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Executive  Summary  ‘  1 

Parallel  Processor  Architecture 

''  The  primary  focus  of  our  architecture  effort  has  been  on  completing  the 
design  of  DASH  and  starting  the  construction  of  the  prototype' 'jThe  design  of 
the  prototype  is  now  complete  and  frozen.  We  have  already^egun  making 
the  necessary  modifications  to  existing  SGI  4D/240  boards.  THe  logic  design 
of  the  directory  board  (the  key  component  of  DASH)  is  now 'complete  and 
entered  into  a  VALID  CAE  system.  It  is  currently  being  simulated.  The 
network  chips  are  back  from  MOSIS.  We  have  tested  them  and  they  are 
functional.  In  cooperation  with  Silicon  Graphics,  we  have  also  begun  work  on 
the  operating  system  for  DASH.  Our  overall  goal  is  to  have  the  hardware  for 
the  prototype  put  together  in  the  next  4  months,  when  we  can  proceed  with 
initial  hardware  and  software  debug. 


We  have  also  been  working  on  novel  techniques  to  reduce  the  invalidation 
traffic  and  directory-memory  requiremerxts'For  large-scale  multiprocessors. 
We  have  come  up  with  two  techni^es'ibr  lowering  invalidation  traffic  and 
directory  memory  requiremer^ -The  first  f/ichnique  is  based  on  the  notion  of 
coarse  vectors.  It  is  a  Ipawt^-pointer  directory  scheme,  but  uses  a  novel 
method  for  handliug-tiverflows.  The  second  technique  abandons  the  idea  of 
having  one  directory  entry  per  memory  block  and  organizes  the  directory 
memo^,-as''k  cache.  This  results  in  significant  savings  in  the  total  directory 
memory  needed.  Performance  results  obtained  so  far  have  been  very 
en/wuraging. 

We  have  continued  the  development  and  use  of  our  software  tracing  and 
/simulation  system.  Tango.  Tango  has  now  been  used  to  simulate  applications 
running  on  over  100  processors.  Performance  evaluations  have  confirmed 
that  Tango  gains  significant  efficiency  by  allowing  the  user  to  specify  both  the 
information  to  be  traced,  and  a  memory  system  model  with  an  appropriate 
level  of  abstraction. 


..  /  Parallel  Software  — 

We  are  building  a  compiler  system  that  supports  our  riisearch  in  a  broad 
spectrum  of  topics,  ranging  from  advanced  scalar  optimizations  to  parallel 
loop  transformations  and  code  scheduling  for  multiprocessors  and  superscalar 
processorsr-The  synerg;;?  of  the  different  research  efforts  on  the  same  system 
makes  possible  a  proper  evaluation  of  individual  compiler  optimizations  as 

well  as  the  system sa,a  whole.  ^ 

- - - - - ^ 

We  have  developed  and  began  implementing  a  loop  restructuring  algorithW 
to  improve  the  data  locality  of  code  for  both  uniprocessor  and  multiprocessor 
systems.  Our  approach  uses  a  novel  unified  model  of  loop  transformations. 


An  ongoing  area  of  exploration  is  the  interaction  between  parallel 
applications  and  architectures.  Recently,  an  experimental  study  of  the  cache 
behavior  of  explicitly  parallel  code  has  shown  that  sharing  is  a  dominant 
factor  in  multiprocessor  performance.  This  study  has  led  to  the  development 
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of  optimizations  that  reposition  shared  data  to  reduce  sharing  induced  by  the 
collocation  of  data  in  a  cache  block. 


Work  continues  on  investigating  approaches  for  compiling  symbolic 
languages  (Scheme)  to  use  parallel  architectures  efficiently. 


J  i  Uniprocessor  Architecture 

^  We  havv  i  continued  our  work  on  investigating  efficient  methods  of  extracting 
the  parallelism  available  at  the  lowest-level,  by  issuing  multiple  instruction 
per  cycle'*) Our  current  work  looks  at  a  combined  hardware-software 
approach^here  the  compiler  is  responsible  for  reordering  the  instructions, 
IpoBBihly 'boosting  an  instruction  through  a  branch,  and  the  hardware  is 
resp^sible  for  not  committing  the  results  of  these  “boosted”  instructipns 
iHitil  the  branch  that  they  depend  on  completes.  The  added  hardware 
^^required  in  this  system  is  small,  yet  the  performance  of  this  system  is 
comparable  to  a  fully  dynamically  reordered  machine  of  much^4reater 
complexity. 

Computer-Aided  Design  — '  / 

Recent  research  has  been  focusing  on  partitioning  techniques  at  the 
behavioral-level  of  abstraction  Partitioning  is  an  important ^ep  in  synthesis 
ofllgrgErdigital.^ystems.  An  effgictive  partitioning  of  the  di^al  hardware  can 
/^ad  to  simplifiedThteTCnUfi^tion  structure  and  efficipht  use  of  hardware 
'  resoorees.  Partitioning  is  performed  to  explore  area/pferformance  tradeoffs, 
inarease'  concurrency  in  hardware  by  creating  multiple  processes  and  possibly 
decrease  complexity  of  control.  Further,  multr^'pipestages  can  be  created 
from  functionally  decomposed  processes.^  Unlike  previous  approaches, 
partitioning  at  hi-level  is  carried  out  bpfbfe  any  of  the  traditional  synthesis 
tasks  such  as  scheduling.  It  is  beliereothat  such  an  approach  would  lead  to 
greater  flexibility  in  area,  and' performance  tradeoffs  than  would  be  possible 
by  partitioning  Ile^^ist^^at  the  logic  level.  Also  since  there  are  fewer  objects  at 

- hklevel  dHBcfiptions,  it  is  possible  to  do  effective  area/performance  tradeoffs 

in  less  time. 

We  are  investigating  techniques  for  verifying  asynchronous,  concurrent 
designs  ranging  from  single  VLSI  chips  to  entire  systems,  such  as  DASH. 
Our  approach  focuses  on  automated  verification,  and  has  already 
demonstrated  some  important  successes. 


Research  on  parallel  CAD  applications  serves  to  drive  our  multiprocessor 
architecture  research  and  to  provide  more  powerful  tools  for  VLSI  designers. 
An  initial  system  for  logic  simulation  has  been  quite  successful,  and  a  new 
system  for  parallel  multilevel  simulation  is  being  developed. 

Our  work  in  analyzing  the  power  and  ground  noise  in  integrated  circuits  is 
progressing  nicely.  We  have  now  used  the  system  on  a  number  of  large 
CMOS  and  ECL  circuits,  and  have  been  working  on  checking  the  errors  that 
are  introduced  by  various  simplifications  done  to  increase  the  efficiency  of  the 
program.  In  the  process  of  checking  these  errors,  we  wrote  a  very  efficient 
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accurate  resistance  extractor,  improving  on  the  work  done  by  Steve 
McCormick  at  MIT,  and  looked  closely  at  modeling  the  effect  of  the  substrate 
current.  The  latter  is  difficult,  since  it  involves  determining  where  the 
current  from  the  bottom-plate  of  the  wire  capacitors  is  collected  by  the 
substrate  and  injected  into  the  power  supply  net.  Many  systems  ignore  this 
cun*ent  entirely,  even  though  it  is  equal  to  the  current  that  flow  through  the 
driving  transistor. 


VLSI  Design 

During  this  period  we  have  continued  our  work  in  high-speed  BiCMOS 
circuits.  We  have  successfully  used  Bisim  to  simulate  the  BiCMOS  sRAM,  as 
well  as  completed  testing  the  fabricated  device.  We  have  also  done  extensive 
testing  qn  the  BiCMOS  Translation  Look-Aside  Buffer  [23]  and  a  BiCMOS 
PLA.  / 
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Technical  Progress 

Parallel  Processor  Architecture 

An  architectural  overview  of  DASH  appears  in  a  paper  attached  to  this 
report. 

Hardware  Status 

During  the  past  six  months,  our  focus  has  been  on  completing  the  design  and 
beginning  the  construction  of  the  DASH  prototype.  This  includes  several  sub- 
efforts: 

(i)  To  be  able  to  use  the  Silicon  Graphics  4D/240  systems  as  the  nodes 
of  DASH,  a  nximber  of  modifications  are  necessary  to  the  processor, 
memory,  and  I/O  boards.  For  example,  we  need  to  modify  the 
processor  boards  and  the  I/O  board  (which  has  the  arbiter)  to  force 
the  processor  off  the  bus  when  it  makes  a  remote-memory  request 
and  eventually  be  able  to  respond  to  the  processor  when  data  has 
been  retrieved.  The  processor  boards  have  also  to  be  modified  to 
support  cache-to-cache  transfer  when  the  data  is  in  shared  state. 
The  memory  boards  need  to  be  modified  to  recognize  non-zero  base 
addresses  plus  a  host  of  other  changes.  We  have  finalized  the 
changes  that  are  necessary  after  detailed  discussions  with  the 
Silicon  Graphics  designers.  We  are  currently  making  these  changes 
to  the  boards  of  two  SGI  systems.  We  have  also  developed 
diagnostic  suites  to  verify  the  correctness  of  our  modifications. 

(ii)  The  directory  board  forms  the  heart  of  the  DASH  multiprocessor. 
There  is  one  such  board  on  each  SGI  4D/240  node.  The  board 
contains  (a)  directory  memory  (keeps  track  of  caching  state  of 
blocks),  (b)  the  directory  controller  (manages  the  directory  memory), 

(c)  pseudo-CPU  (services  requests  from  remote  CPUs),  (d)  reply 
controller  (handles  replies  to  this  cluster's  remote  requests),  (e) 
network  interface,  and  (f)  the  performance  monitor.  At  this  point 
the  logic  design  of  all  of  these  subcomponents  is  complete  and 
entered  into  our  VALID  CAE  tools.  We  are  currently  doing 
simulations  to  check  the  correctness  of  the  logic  and  to  do  timing 
verification.  Within  the  next  1-2  months  we  expect  to  send  the 
board  out  for  fabrication. 

(iii)  In  parallel  with  the  work  on  hardware  we  have  also  built  a  detailed 
software  simulator  of  the  DASH  architecture.  The  simulator  closely 
follows  the  hardware  of  SGI  4D/240  and  our  design  of  the  directory 
board.  We  are  using  this  simulator  to  debug  the  distributed  cache- 
coherence  protocol  used  in  DASH  using  pseudo-random  test 
generation  methods.  We  are  also  using  this  simulator  to  obtain  test 
vectors  for  simulation  of  directory  board  on  VALID  tools.  The 
simulator  reinforces  the  need  for  experimental  hardv/are,  since  it 
runs  10,000  times  slower  (per  processor)  than  DASH. 
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(iv)  Finally,  in  close  cooperation  with  Silicon  Graphics  Inc.,  we  are 
modif^ng  the  existing  operating  system  on  the  4D/240  to  work  on 
DASH.  This  includes  modifying  the  memory  management  code, 
interrupt  handlers,  TLB  consistency  mechanisms.  Much  work  is 
also  being  done  on  modifying  and  developing  new  low-level 
diagnostics  that  will  check  the  health  of  the  system  when  it  first 
comes  up  and  for  initially  configuring  the  hardware. 

Our  overall  goal  is  to  have  a  prototype  of  DASH  with  16  processors  to  be  up  in 
about  six  months  from  now.  This  will  involve  getting  all  the  hardware  put 
together  in  about  four  months,  leaving  two  months  for  initial  hardware  and 
software  debug. 

Basic  Architecture  Studies  and  Simulation  Tools  Effort 

In  addition  to  the  work  on  the  prototype,  we  have  been  working  on  novel 
techniques  to  reduce  the  invalidation  traffic  and  directory-memory 
requirements  for  large-scale  multiprocessors.  Recall  that  directory-based 
cache-coherence  schemes  rely  on  a  directory  to  keep  track  of  all  processors 
caching  a  memory  block.  When  a  write  to  that  block  occurs,  point-to-point 
invalidation  messages  are  sent  to  keep  the  caches  coherent.  A 
straightforward  way  of  recording  the  identities  of  processors  caching  a 
memory  block  is  to  use  a  bit  vector  per  memory  block,  with  one  bit  per 
processor.  Unfortunately,  when  the  main  memory  grows  linearly  with  the 
number  of  processors,  the  total  size  of  the  directory  memory  grows  as  the 
square  of  the  number  of  processors,  which  is  proliibitive  for  large  machines. 
To  remedy  this  problem  several  schemes  that  use  a  limited  number  of 
pointers  per  directory  entry  have  been  suggested  [1].  Unfortunately,  these 
schemes  often  cause  excessive  invalidation  traffic. 

We  have  proposed  two  methods  for  lowering  invalidation  traffic  and  directory 
memory  requirements.  The  first  is  the  coarse  vector  directory  scheme.  In  the 
most  common  case  of  a  block  being  shared  between  a  small  number  of 
processors,  the  directory  is  kept  in  the  form  of  several  pointers.  Each  points 
to  a  processor  which  has  a  cached  copy.  When  the  number  of  processors 
sharing  a  block  exceeds  the  number  of  pointers  available,  the  directory 
switches  to  a  different  representation.  The  same  memory  that  was  used  to 
store  the  pointers  is  now  treated  as  a  coarse  bit-vector,  where  each  bit  of  the 
state  indicates  a  group  of  processors.  We  term  this  new  directory  scheme 
DiriCVr,  where  i  is  the  number  of  pointers  and  r  is  the  size  of  the  region  that 
each  bit  in  the  coarse  vector  represents.  With  all  bits  set,  the  equivalent  of  a 
broadcast  is  achieved.  While  using  the  same  amount  of  memory,  the 
proposed  scheme  is  at  least  as  good  as  the  limited  pointer  scheme  with 
broadcast  -  presented  as  DiriB  in  [1]. 

The  second  method  reduces  directory  memory  requirements  by  organizing  the 
directory  as  a  cache,  instead  of  having  one  directory  entry  per  memory  block. 
Since  the  total  size  of  main  memory  in  machines  is  much  larger  than  that  of 
all  cache  memory,  at  any  given  time  most  memory  blocks  are  not  cached  by 
any  processor  and  the  corresponding  directory  entries  are  empty.  The  idea  of 
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a  directory  cache  that  only  contains  the  active  entries  is  thus  appealing. 
Furthermore,  there  is  no  need  to  have  a  backing  store  for  the  directory  cache. 
The  state  of  a  block  can  safely  be  discarded  after  invalidation  messages  have 
been  sent  to  all  processor  caches  with  a  copy  of  that  block.  Our  scheme  of 
directory  caches  brings  down  the  storage  requirements  of  main  memory 
based  directories  closer  to  that  of  cache-based  linked  list  directory  schemes 
such  as  the  SCI  scheme.  However,  we  avoid  the  disadvantages  of  cache- 
based  directories  -  having  to  follow  linked  lists  to  determine  the  processors 
caching  a  memory  block  and  having  to  build  the  directory  out  of  expensive 
SRAM  cache  memory. 

We  have  done  limited  comparisons  of  the  full  bit-vector  scheme  and  existing 
limited  pointer  schemes  with  our  coarse  vector  scheme.  We  have  also 
completed  preliminary  evaluations  of  cached  directories.  The  performance 
results  obtained  so  far  have  been  encouraging.  (We  used  the  data  and 
invalidation  traffic  generated  by  the  different  schemes  as  a  measure  of 
performance.)  Our  results  show  that  depending  on  the  application,  the  coarse 
vector  scheme  produces  up  to  65%  less  memory  message  traffic  than  the  next 
best  limited  pointer  scheme.  Caching  the  directory  adds  less  than  10%  to  the 
traffic  while  potentially  reducing  directory  memory  overhead  by  an  order  of 
magnitude. 

We  have  continued  the  development  and  use  of  our  software  tracing  and 
simulation  system.  Tango.  Tango  has  now  been  used  to  simulate  applications 
running  on  over  100  processors.  Performance  evaluations  have  confirmed 
that  Tango  gains  significant  efficiency  by  allowing  the  user  to  specify  both  the 
information  to  be  traced,  and  a  memory  system  model  with  an  appropriate 
level  of  abstraction.  Tango  has  been  extended  to  run  with  applications 
written  in  Fortran  as  well  as  C,  which  increases  the  set  of  applications 
available  to  us  for  study.  Tango  is  currently  being  used  to  support  a  variety 
of  investigations  in  algorithms  for  logic  simulation  and  linear  algebra,  shared 
data  locality,  program  synchronization  characteristics,  memory  system 
design,  and  the  DASH  architecture. 

Our  initial  DASH  studies  used  a  detailed  cycle-by-cycle  simulation  of  the 
memory  system,  and  we  found  that  most  of  the  time  in  Tango  was  spent 
simulating  the  memory  system.  This  discovery  prompted  efforts  to  consider 
faster  approaches  to  DASH  memory  system  simulation.  We  now  have  several 
alternative  simulators.  One  is  an  event-driven  simulator  that  is  2-5  times 
faster.  W^e  also  have  simulators  that  approximate  contention  and  cache 
behavior  in  DASH;  these  simulators  allow  the  user  to  trade-off  modest 
amounts  of  accuracy  for  simulations  that  are  orders  of  magnitude  faster,  and 
are  especially  valuable  when  studying  the  behavior  of  applications  rimning 
on  DASH. 

We  have  recently  begun  the  design  of  interactive  visualization  tools  that  will 
aid  users  in  understanding  simulation  results.  An  interface  to  Tango  will 
permit  displaying  both  general  information  about  application  behavior,  and 
also  information  that  is  specific  to  the  memory  system  used  in  the  simulation. 
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Parallel  Software 


Overview  of  the  Compiler  Development 

We  have  developed  a  solid  platform  for  research  in  scalar  and  parallel  code 
compilation  and  compiler/architecture  tradeoffs.  We  have  implemented  a 
basic  system  that  compiles  Fortran  code  into  MIPS  assembly  code. 
Integrating  the  information  used  in  scalar  and  parallel  code  optimizations 
into  one  representation,  this  system  is  supportive  of  research  into  a  wide 
range  of  topics.  Our  current  investigation  includes  the  following  key  compiler 
issues: 

•  A  scalar  data  flow  optimizer  generator 

•  Measurement  on  the  effectiveness  of  data  dependence  tests 

•  Measurement  of  parallelism  in  applications 

•  A  new  technique  to  analyze  the  data  flow  of  array  data 

•  Loop  level  transformations  for  parallelism  and  data  locality 

•  Superscalar  code  scheduling 

•  Data  placement  for  mxaltiprocessors 

Developing  a  compiler  system  for  uni-  and  multiprocessors  can  potentially 
entail  a  significant  laborious  implementation  effort.  Fortunately,  we  have 
been  able  to  develop  new  insights  and  algorithms  for  many  of  the  major 
components  of  the  system  and  increase  the  research  value  of  our 
implementation  effort.  For  example,  even  in  the  well  studied  domain  of 
scalar  code  optimizations,  we  have  developed  the  new  concept  of  a  data  flow 
optimizer  generator  and  we  are  in  the  process  of  implementing  the  algorithm. 
Similarly,  as  we  study  the  problem  of  data  locality,  we  have  developed  a 
unifying  approach  to  general  loop  transformations  and  greatly  simplified  the 
implementation  of  such  optimizations. 

Through  the  broad  coverage  of  this  research  project,  we  hope  to  develop  a 
complete,  fully  optimizing  compiler.  In  this  way,  we  can  evaluate  individual 
optimizations  properly,  within  the  context  of  all  other  optimizations.  More 
importantly,  this  approach  allows  us  to  evaluate  optimizing  compilers  as  a 
whole  and  identify  the  true  strengths  and  weaknesses  of  the  compiler 
technology  and  how  it  must  advance  to  better  support  parallel  processing.  A 
recent  paper  covering  some  of  this  work  is  attached  this  report. 

Integrating  Data  Flow  with  Data  Dependence 

The  integration  of  scalar  and  parallel  code  optimizations  in  one  system 
fosters  the  development  of  new  fundamental  compiler  techniques.  The 
current  state  of  the  art  is  that  data  flow  analysis  does  not  distinguish 
between  different  elements  of  an  array,  whereas  data  dependence  analysis 
disambiguates  array  references  but  is  insensitive  to  data  flow.  If  data  flow 
analysis  can  be  combined  with  data  dependence  information,  the  new 
analysis  data  can  give  rise  to  new  code  optimizations  such  as  allocating  array 
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references  in  registers  and  expanding  array  variables  to  generate  more 
parallelism  in  miStiprocessor  programs. 

Unlike  previous  approaches  that  augment  data  flow  techniques  to  provide 
more  precise  information  on  arrays,  we  are  investigating  the  technique  of 
extending  the  data  dependence  technique  to  account  for  data  flow. 
Stimulated  by  the  motivation  to  support  new  optimizations,  we  have 
developed  a  new  data  dependence  representation  that  appears  to  better 
support  even  the  existing  code  transformations. 

The  key  insight  is  to  characterize  each  array  reference  by  its  position  on  an 
absolute  scale.  Any  relationship  between  any  two  or  more  references  can  be 
derived  simply  and  directly  from  this  representation.  This  is  in  contrast  to 
the  existing  approach  where  the  dependence  relationships  between  evep^  pair 
of  accesses  are  explicitly  represented.  It  is  costly  to  generate  the  pair-wise 
information  in  the  first  place,  and  it  is  even  more  so  to  derive  information 
relating  more  than  two  cells  from  this  pair-wise  representation.  Moreover, 
while  it  is  trivial  to  update  our  dependence  information  after  each 
transformation,  existing  systems  tend  to  completely  re-evaluate  all 
dependences  after  each  code  optimization  phase. 

In  this  initial  study,  we  concentrated  only  on  array  references  that  sweep  the 
entire  array  space  exactly  once  in  the  entire  loop  nest.  As  the  results  we 
obtain  are  quite  promising,  we  plan  to  extend  this  model  to  more  general 
references,  and  to  formulate  the  legality  of  existing  loop  level  transformations 
using  this  new  representation. 

Optimizations  for  Data  Locality  and  Parallelism 

While  the  computation  bandwidth  of  processors  experiences  phenomenal 
increases  in  recent  years,  the  main  memory  subsystem  fails  to  keep  pace  and 
thus  the  gap  between  processor  and  memory  speeds  continues  to  widen.  It  is 
no  longer  feasible  to  ignore  the  memory  hierarchy;  compilers  must  improve 
the  cache  performance  by  increasing  the  locality  of  data.  We  have  developed 
and  implemented  a  compiler  optimization  algorithm  that  restructures 
numerical  kernels  to  maximize  data  locality.  Besides  finding  data  locality, 
our  algorithm  can  also  discover  both  fine  and  coarse  grain  parallelism,  the 
former  exploitable  on  a  superscalar  processor  and  the  latter  on  a 
multiprocessor.  Our  current  implementation  generates  runnable  code  for 
sequential  machines.  As  for  multiprocessor  code,  the  implementation  can 
already  identify  the  parallelism,  and  development  of  a  backend  to  generate 
runnable  multiprocessor  code  is  in  progn'ess.  Our  next  step  is  to  experiment 
with  the  implementation  using  benchmark  programs. 

In  focusing  on  this  new  problem  of  data  locality,  we  have  developed 
techniques  that  are  important  to  the  general  field  of  loop  optimizations. 
Numerous  loop  transformations  (e.g.  loop  interchange,  reversal,  skewing  and 
tiling)  have  been  proposed  and  shown  to  be  useful  for  vectorization  and 
concurrentization.  An  issue  critical  to  using  these  optimizations  in  practice  is 
how  to  decide  when  and  which  of  these  transformations  should  be  applied. 
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Each  individual  transformation  had  a  different  legality  test.  However, 
individual  transformations  may  not  all  directly  contribute  to  improving  a 
particular  goal,  but  may  nonetheless  be  important  for  making  other 
optimizations  legal.  This  has  led  to  the  proposed  “generate  and  test" 
technique  which  generates  and  evaluates  all  programs  obtained  through  all 
different  possible  permutations  of  transformations. 

Our  approach  is  to  unify  all  legality  tests  of  transformations  and  their  effects 
in  the  same  mathematical  model  of  iteration  space  and  dependence  vectors, 
so  we  can  define  the  legality  test  and  effect  of  compound  transformations. 
Under  this  model,  a  compound  transformation  on  sequential  loops  is  legal 
simply  if  the  dependence  vectors  after  transformation  remain 
lexicographically  positive.  By  expressing  the  goal  of  the  transformations  as  a 
desired  effect  on  the  code,  we  can  obtain  the  best  compound  transformation 
efficiently. 

Through  the  analysis  under  this  unified  framework,  we  show  that  tiling  is 
the  key  transformation  useful  for  generating  data  locality  and  parallelism. 
The  problem  of  data  locality  is  reduced  to  placing  the  maximum  number  of 
loops  identified  to  carry  locality  in  the  innermost  tile.  Using  this  goal  and  the 
legality  considerations  of  tiling,  we  can  significantly  prune  the  search  space 
to  find  the  best  compound  transformation.  The  conditions  that  made  tiles 
legal  in  the  first  place  guarantee  both  coarse  and  fine  grain  parallelism 
within  a  tiled  loop.  Thus,  we  have  an  extremely  simple  algorithm  that  can 
generate  programs  that  contain  both  parallelism  and  locality. 

This  unification  and  simplification  of  the  conceptual  model  is  particularly 
significant  when  it  comes  to  implementation.  In  the  previous  approach  where 
only  the  effects  of  individual  transformations  are  understood,  a  compound 
transformation  has  to  be  implemented  through  a  series  of  transformations. 
With  each  transformation,  the  code  (e.g.  the  loop  bounds)  becomes  more 
complex  due  to  the  effects  on  the  boundary  conditions.  However,  if  the 
compound  transformation  is  performed  on  the  original  simple  source 
program,  the  resulting  code  and  thus  the  implementation  are  much 
simplified. 

Improving  the  Cache  Performance  of  Explicitly  Parallelized  Code 
Another  component  of  our  research  in  cache  optimizations  focuses  on 
explicitly  parallelized  applications  for  multiprocessors  that  support  cache 
coherence  in  hardware.  In  this  case,  it  is  the  behavior  of  the  shared  data  that 
dominates  the  cache  performance,  which  in  turn  is  a  critical  factor  in 
determining  the  multiprocessor  performance.  We  have  performed  some 
exploratory  research  to  xxnderstand  and  quantify  the  significance  of  sharing 
in  application  benchmarks.  In  addition  to  investigating  the  effect  of  the 
compiler  and  its  role  in  improving  the  cache  performance,  we  also  study  the 
implication  of  sharing  on  the  cache  design. 

Most  studies  of  the  behavior  of  shared  data  have  been  based  on  unoptimized 
code.  By  measuring  code  compiled  with  an  optimizing  compiler,  we  found  a 
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much  higher  fraction  of  shared  references  in  the  data  stream  than  was 
previously  suggested.  Traditional  compiler  optimizations  optimize  away  non- 
shared  memory  references  and  thus  substantially  increase  the  significance  of 
shared  data  references.  The  fraction  of  shared  references  in  the  data  stream 
more  than  doubled  on  average,  reaching  values  from  33%  to  over  90%.  Since 
the  programs  run  father  after  optimization  while  the  number  of  shared 
references  remains  the  same,  the  shared  data  access  bottleneck  is  intensified. 
The  large  discrepancy  in  shared  reference  ratios  between  optimized  and 
unoptimized  code  suggests  that  performance  studies  of  multiprocessor 
programs  must  be  based  on  optimized  code  for  the  results  to  be  meaningful. 

The  cache  miss  rate  and  the  traffic  created  by  sharing  depend  on  two  effects: 
true  sharing,  which  is  intrinsic  to  a  reference  stream  of  a  parallel  program, 
and  false  sharing,  which  is  induced  by  the  collocation  of  different  data  items 
in  the  same  cache  block.  Measurements  on  parallel  programs  indicate  that 
false  sharing  misses  generally  increase  with  block  size,  while  cold  and  true 
sharing  misses  generally  decrease  with  block  size.  However,  the  true  sharing 
miss  rate  drops  more  slowly  than  might  be  expected,  leading  in  some  cases  to 
an  increase  in  the  overall  miss  rate  and  in  almost  all  cases  a  dramatic  growth 
in  the  amount  of  shared  data  traffic  vdth  increasing  block  sizes.  Most  of  this 
traffic  is  composed  of  words  that  will  not  be  referenced  by  a  given  processor 
between  two  consecutive  misses  on  the  block  by  the  same  processor.  This 
suggests  that  short  data  cache  blocks  should  be  used  to  maximize 
performance. 

The  analysis  suggests  two  approaches  to  improve  the  cache  behavior  of  the 
programs:  repositioning  the  shared  data  in  memory  at  the  block  level,  and 
changes  in  the  data  structure  organization  or  in  the  assignment  of 
computation  to  processors.  The  first  approach  is  appropriate  for  applications 
with  substantial  false  sharing  misses;  the  second  one  is  especially 
appropriate  for  programs  with  low  false  sharing  and  a  very  slow  decrease  in 
the  cold  and  true  sharing  misses  with  block  size  increases.  The  first 
approach  is  evaluated  and  demonstrated  to  have  a  significant  impact  on  some 
of  the  applications  studied.  We  expect  that  these  optimizations  will  be  even 
more  important  in  future  large-scale  multiprocessors.  Firstly,  higher 
communication  latencies  are  likely  to  be  involved,  increasing  the  penalty  of 
each  cache  miss.  Secondly,  with  larger  numbers  of  processors,  the  ratio  of 
misses  eliminated  over  program  execution  time  will  possibly  be  higher. 

Even  after  the  placement  optimizations,  we  observe  that  multiprocessor 
applications  may  have  significantly  high  miss  rates  than  the  uniprocessor 
miss  rates.  For  large  multiprocessors  to  be  effective  on  the  type  of  programs 
considered,  advances  need  to  be  made  in  reducing  the  miss  rates  by  better 
algorithm  design,  and  in  building  hardware  and  software  systems  that 
tolerate  the  high  cache  miss  rate  and  traffic  patterns  of  shared  data.  We  are 
currently  in  the  process  of  studying  the  feasibility  of  program-directed  cache 
prefetching. 
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Scheme 

This  research  focuses  on  automatic  program  transformations  for  efficiently 
implementing  scientific  and  symbolic  programs  on  serial  and  parallel 
machines.  After  implementing  several  toy  systems  to  flesh  out  our  ideas  and 
techniques,  we  are  ready  to  implement  a  fully  automatic  partial  evaluator 
that  handles  all  of  the  Scheme  programming  language.  We  have  shown  that 
partial  evaluation  exposes  tremendous  amounts  of  instruction  level 
parallelism.  This  is  an  important  feature  for  harnessing  the  power  of 
superscalar  and  superpipelined  architectures.  To  date  our  results  have  been 
obtained  using  a  programmer  directed  partial  evaluator.  We  feel  that  our 
techniques  and  results  will  become  fully  acceptable  only  when  they  become 
fully  automatic.  (A  separate  proposal  has  been  submitted  in  response  to 
BAA#90-03  to  fund  the  construction  of  this  partial  evaluator.) 

On  a  separate  topic,  Katz  and  Weise  have  invented  methods  for  making  the 
future  construct  of  MultiLisp  and  MultiScheme  interact  properly  with  the 
continuation  mechanism  of  Scheme.  Previous  implementations  exhibit 
pathological  behavior  when  futures  and  continuation  were  used  together. 


Uniprocessor  Architecture 

Boosting  Beyond  Static  Scheduling  in  a  Superscalar  Processor 
We  continue  to  search  for  opportunities  where  new  combinations  of  software 
and  hardware  functionality  can  propel  the  performance  level  of  processors  to 
a  new  height.  An  important  problem  that  demands  creative  cooperation 
between  software  and  hardware  is  the  parallelization  of  non-numerical  code. 

Software  techniques,  e.g.  software  pipelining,  have  previously  been  shown  to 
be  effective  in  exploiting  the  parallelism  in  statically-scheduled  superscalar 
or  VLIW  (Very  Long  Instruction  Word)  machines.  This  approach  relies  on 
the  numerous  “do-loop”  branches  that  can  be  evaluated  early  to  expose  the 
parallelism  across  many  iterations.  However,  non-numeric  code  has  more 
data  dependent  branches  that  cannot  be  resolved  early.  Hardware-scheduled 
superscalar  machines  can  parallelize  nonnumeric  code  better  by  speculatively 
executing  operations  beyond  a  branch  before  the  branch  decision  is  resolved. 
Unfortunately,  the  additional  hardware  necessary  to  look  far  ahead  in  the 
dynamic  instruction  stream,  find  independent  operations,  and  schedule  these 
independent  operations  out  of  order  is  costly  and  complex.  The  complexity 
may  lengthen  the  cycle  time  of  the  machine  and  reduce  the  actual  speedup 
over  the  scalar  processor. 

We  have  developed  an  architecture  that  combines  the  best  qualities  of  static 
and  dynamic  instruction  scheduling  to  increase  the  performance  of  non- 
numerical  applications  [20].  The  architecture  performs  all  instruction 
scheduling  statically  to  take  advantage  of  the  compiler's  ability  to  efficiently 
schedule  operations  across  many  basic  blocks  and  the  lower  hardware  cost. 
Since  the  conditional  branches  in  non-numerical  code  are  highly  data 
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dependent,  the  architecture  introduces  the  concept  of  boosted  instructions, 
instructions  that  are  committed  conditionally  upon  the  result  of  later  branch 
instructions.  Boosting  effectively  removes  control  dependence  as  a 
scheduling  constraint  and  makes  the  scheduling  of  side-effect  instructions  as 
simple  as  those  that  ai'e  side-effect  free.  For  efficiency,  boosting  is  supported 
in  the  hardware  by  shadow  structures  that  temporarily  hold  the  side  effects  of 
boosted  instructions  until  the  conditional  branches  that  the  boosted 
instructions  depend  upon  are  executed.  When  the  branch  condition  is 
determined,  the  buffered  side  effects  are  either  committed  or  squashed. 

Our  experiments  on  non-numerical  code  show  that  adding  a  single  level  of 
boosting  to  a  statically-scheduled  superscalar  processor  yields  a  1.6-time8 
speedup  over  scalar  code.  This  performance  is  comparable  to  the 
performance  of  an  aggressive,  dynamically-scheduled  superscalar  processor. 
Consequently,  we  are  working  on  a  compiler  and  a  simulator  for  the  full 
furictionality  of  boosting.  This  system  will  experiment  with  the  boosting  of 
instructions  above  multiple  conditional  branches,  the  reorganizing  of 
instructions  both  up  and  down  across  multiple  basic  block  boundaries,  and 
the  advantages  of  memory  disambiguation. 

Even  if  hardware  scheduled  superscalars  could  be  made  to  run  at  the  same 
clock  speed,  architectures  with  boosting  are  still  faster.  Sophisticated 
compilers  can  identify  the  operations  on  the  critical  path.  By  boosting  these 
operations  up  branches,  the  compiler  can  cause  these  operations  to  execute 
even  before  they  could  have  been  fetched  were  the  machine  a  hardware- 
scheduled  superscalar.  As  we  scale  up  the  degree  of  parallelism  in  the 
machine,  this  advantage  will  become  even  more  significant  due  to  a  higher 
ratio  of  branches  per  instruction. 


Computer  Aided  Design 

S3^thesis 

In  our  approach  the  hardware  behavior  is  first  abstracted  in  terms  of  a 
sequencing  graph  model.  The  problem  of  hardware  partitioning  is  then 
formulated  as  a  graph  partitioning  problem.  The  input  graph  model  may  be 
hierarchical  with  externally  imposed  resource  and  timing  constraints.  Area 
and  latency  estimates  for  hierarchical  graphs  are  computed  in  a  bottom-up 
manner.  Once  an  optimum  partition  is  determined,  a  set  of  behavior¬ 
preserving  transformations  are  applied  to  the  sequencing  graphs  in  order  to 
obtain  two  interacting  sequencing  graphs. 

We  have  applied  Kemighan-Lin  heuristics  to  hardware  partitioning  with  the 
main  objective  of  performing  area-directed  cuts,  such  that  a  cost  function  of 
overall  latency  and  intercommunication  cost  is  minimized.  In  addition 
constraints  are  placed  on  the  size  and  pin-out  of  the  individual  blocks. 
Presently  we  are  investigating:  ways  of  performing  incremental  scheduling 
during  the  partitioning  process,  strategies  to  handle  timing  constraints 


12 


between  sequencing  graphs,  partitioning  in  presence  of  unbounded  delay 
operations,  and  resource  sharing  between  partitioned  blocks. 

Formal  Verification  of  Concurrent  Hardware 

Multiprocessor  hardware  is  highly  concurrent,  by  its  nature.  Unfortunately, 
concurrent  hardware  designs  are  difficult  to  debug  by  the  usual  methods  of 
simulation  and  prototyping,  because  concurrent  systems  behave 
nondeterministically.  This  nondeterminism  leads  to  transient  and 
nonrepeatable  problems  that  are  difficult  to  localize  and  diagnose. 

One  potential  solution  to  this  problem  is  formal  verification  using  automated 
theorem-provers.  Formal  hardware  verification  has  been  improving  rapidly, 
to  the  point  where  simple  processors  have  been  verified.  However,  existing 
techniques  are  not  necessarily  effective  for  concurrent  systems.  For  example, 
architectural  features  that  depend  on  concurrent  interaction  have  been 
omitted  from  almost  all  formally  verified  processor  designs.  These  include 
caches,  pipelining,  interrupts,  and  page  mapping.  We  have  addressed  this 
deficiency  by  developing  techniques  for  using  existing  theorem-provers  for 
concurrent  designs.  Recently,  we  have  had  some  success  in  using  the  HOL 
(“Higher-Order  Logic”)  system  to  verify  a  simple  sequential  cache  and  a  more 
sophisticated  directory-based  multiprocessor  cache. 

In  our  verification  method,  the  implementation  (system  behavior)  and 
specification  (desired  behavior)  are  both  represented  as  state  graphs.  The 
graphs  can  have  an  infinite  number  of  states,  since  the  states  are  defined  by 
logical  formulas,  not  represented  explicitly,  To  prove  that  the  implementation 
meets  a  specification,  one  gives  a  logical  definition  of  a  correspondence 
between  the  states  of  the  two  graphs.  This  correspondence  is  called  a 
refinement  relation.  The  properties  of  a  refinement  relations  basically  ensure 
that  for  every  execution  path  through  the  implementation,  there  is  a  parallel 
path  through  the  specification  (indicating  that  the  implementation  path  is 
desirable).  The  theorem-prover  is  used  to  show  that  the  supposed  refinement 
relation  really  IS  a  refinement  relation. 

This  technique  is  sufficient  for  proving  safety  properties,  which  assert 
(intuitively)  that  “nothing  bad  happens”.  For  example,  a  processor  receiving 
an  inconsistent  value  from  the  cache  because  of  a  protocol  bug  would  be  a 
violation  of  a  safety  property.  Techniques  for  proving  liveness  properties  (e.g. 
that  every  read  request  is  eventually  satisfied)  have  yet  to  be  developed. 

The  proof  of  the  simple  cache  is  complete;  the  proof  of  the  multiprocessor 
cache  is  80%  complete.  Papers  describing  this  work  have  been  submitted  to 
the  International  Conference  on  Computer  Design  and  to  the  Workshop  on 
Computer-Aided  Verification  (to  be  held  in  Jime,  1990). 

Power  and  Ground  Noise  Analysis 

As  circuit  technologies  scale,  the  resistance  of  the  wires  on  the  circuit  become 
larger  relative  to  the  effective  resistance  of  the  scaled  transistors.  This 
problem  is  made  worse  by  the  even  increasing  die  size.  The  combination  of 
these  effects  makes  the  problem  of  iR  drop  in  the  power  distribution  wiring 
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an  increasingly  significant  problem  for  the  VLSI  designer,  and  a  problem  that 
is  likely  to  get  worse  as  technologies  improve.  To  better  understand  the 
problem,  and  to  provide  some  means  for  a  designer  to  check  the  supply  drops 
in  his/her  chip,  we  have  written  a  software  system  called  Ariel  that  can 
determine  the  supply  drops  in  a  VLSI  chip.  The  system  works  for  both 
CMOS  and  ECL  circuits,  and  has  been  used  to  check  a  number  of  large  chips. 

The  system  consists  of  three  main  components,  a  resistance  extractor  to 
create  a  resistor  network  model  of  the  power  supply  line,  a  switch  level 
simulator  that  captures  when  nodes  transition  and  the  shape  of  the  current 
pulse  that  is  generated,  and  a  linear  solver  for  sparse  equations  that  is  used 
to  generate  the  node  voltages  from  the  injected  currents  and  the  power 
supply  model.  Since  the  systems  we  are  interested  in  solving  have  large 
power  supply  nets  with  10-100  thousand  transistors  connected  to  them, 
efficiently  in  the  algorithms  is  extremely  important.  We  have  been  able  to 
run  the  complete  system  on  a  number  of  CMOS  CPU  chips. 

Our  recent  work  has  been  focused  on  checking  some  of  the  approximations 
Ariel  makes  to  increase  efficiency.  There  are  two  main  places  where 
approximations  are  used.  The  first  is  in  the  extraction  of  the  power-supply 
resistance  network.  Rather  than  using  an  exact  solution,  a  polygonal 
approximation  is  used,  which  greatly  reduces  both  the  number  of  nodes  in  the 
resulting  network,  and  the  time  required  to  generate  the  network.  We  have 
recently  finished  writing  an  accurate,  efficient  resistor  extractor  that  is  based 
on  the  work  done  by  Steve  McCormick  at  MIT,  but  uses  a  more  efficient 
solver  for  Laplace's  Equation.  Although  this  system  is  much  slower  than  the 
original  extractor,  it  does  allow  use  to  compare  the  results  of  the  two  systems. 
These  results  indicate  that  the  voltage  found  by  the  simpler  extractor  are 
within  20%  of  the  voltages  using  the  true  resistance  (worse  case  in  the 
examples  we  have  tried),  and  the  nominal  results  are  very  close. 

The  other  approximation  is  in  modeling  how  the  substrate  current  enters  the 
power-supply  network.  Since  the  currents  through  the  two  terminal  of  a 
capacitor  are  equal  the  current  from  the  bottom-plate  of  the  capacitor  is  equal 
to  the  current  that  flows  from  the  top-plate  through  the  transistor.  Ariel  uses 
a  simple  bounding  box  model,  it  takes  the  bounding  box  of  the  wire  being 
driven,  and  divides  the  substrate  current  evenly  between  all  the  substrate 
contacts  contain  in  the  bounding  box.  To  check  this  approximation  we  have 
changed  the  system  to  more  carefully  divide  up  the  substrate  under  the  wire 
and  distribute  the  current  more  accurately.  The  difference  in  the  calculated 
voltages  for  this  change  are  small,  on  the  order  of  5%  or  less. 

We  plan  to  continue  our  work  on  this  system  to  bring  it  to  the  point  where 
others  can  use  the  tool,  V/e  have  already  had  interest  from  a  number  of  chip 
design  companies  to  get  a  copy  of  the  system. 
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Simulation 

Parallel  Simulation 

CAD  applications  have  been  proposed  to  further  explore  the  potentials  of  the 
DASH  multiprocessor  machine  and  parallel  paradigms  in  general.  Research 
has  been  done  on  integration  of  existing  different  simulators  to  obtain  a 
parallel  multi-level  simulator.  Parallelism  is  obtained  at  each  simulator  level 
by  decomposing  its  simulation  into  smaller  blocks  and  managing  the 
communication  and  synchronization  of  these  blocks. 

Initially,  algorithms  for  distributed  and  parallel  simulation  were 
investigated.  The  framework  of  interest  was  a  distributed  discrete-event 
simulation  within  a  set  of  communicating  elements  that  exchange 
information  through  messages  and  distributed  time.  We  started  with  the 
well-known  time-stamped  algorithms  (Chandy  and  Misra)  to  reduce  deadlock 
occurrence,  increase  concurrency,  and  reduce  communication  traffic 
overhead.  The  concepts  of  intervals,  rather  than  time  stamps,  are  used  to 
represent  the  period  of  time  during  which  an  event  is  valid  in  a  simulation. 
With  their  use  there  are  situations  in  which  a  simulated  element  will  not 
block  while  it  would  have  otherwise  if  time-stamped  messages  were  used. 
With  the  use  of  this  kind  of  messages,  it  is  proved  that  acyclic  network 
simulations  are  deadlock  free.  The  use  of  labels  on  interval  messages  was 
introduced  to  allow  messages  to  detect  loops  and  obtain  event  scheduling 
optimizations.  An  interesting  feature  of  inter\'als  is  that  it  allows  simulation 
of  future  intervals  in  present  time.  These  results  were  presented  in  the  SCS 
Multiconference  on  Modeling  and  SimvJation  on  Microcomputers  [24]. 

A  prototype  to  test  and  explore  the  concept  of  parallel  nxixed-mode  simulation 
is  under  development.  The  simulators  being  integrated  consist  of  THOR,  a 
behavioral  simulator  for  use  with  digital  circuits  at  either  the  functional, 
register  transfer  or  gate  level,  IRSIM,  a  switch-level  simulator  for  MOS 
circuits,  and  SPICE,  a  general-purpose  circuit-level  simulator.  This 
prototype  is  like  a  printed  circuit  board  backplane  where  existing  simulator 
programs  are  plugged  in  to  rxm  in  parallel  either  at  the  same  design  level  or 
at  different  design  levels  to  function  as  a  harmonic  mixed  level  simulator. 
Basically  it  consist  of  a  kernel  to  handle  the  interface  to  simulators  and 
coordinate  their  computation.  It  uses  the  concepts  mentioned  above  of 
interval  messages,  and  labeled  messages  within  loops.  Also  it  handles 
conversion  between  design  levels  as  in  the  case  of  logic  signals  and  circuit 
signals. 

Mixed-mode  simulation  in  general  does  not  impose  implementation 
difficulties;  it  is  the  efficient  parallel,  or  distributed,  processing  of  such 
simulation  that  introduces  design  complexities.  First,  mixed-mode  simulation 
requires  the  integration  of  all  the  functionalities  that  exist  at  each  of  the 
simulation  levels.  Furthermore,  other  levels  or  different  scheduling 
algorithms  are  expected  to  appear  or  evolve  in  the  future.  Our  solution  to 
integrating  existing  programs  deals  vdth  this  particular  problem.  Second,  the 
decomposition  of  the  system  under  simulation  is  an  important  task  that  will 
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affect  performance.  The  third  issue  is  the  distributed  algorithm  used. 
Conventional  Chandy-Misra  algorithms  --  deadlock  avoidance,  deadlock 
recovery  -  suffer  from  performance  degradation.  Optimistic  algorithms  as 
used  in  the  Time-Warp  are  not  applicable  to  mixed-mode  simulation  because 
of  the  complexity  involved  in  simulation  roll-back.  The  use  of  time  intervals 
and  labeled  messages  on  the  actual  prototype  will  be  measured  and  analyzed 
regarding  their  impacts  on  performance.  Finally,  a  consistent  and  uniform 
user  interface  capable  of  handling  all  the  levels  will  be  provided. 

This  prototype  system  is  to  be  used  for  the  verification  of  an  adaptive  signal 
processing  system  at  both  the  chip  and  the  board  levels,  and  for  the  design  of 
a  complex  CMOS  Viterbi  detection  chip. 


VLSI  Design 

We  had  designed  a  64K  BiCMOS  sRAM  that  was  fabricated  in  TI's  0.8u 
BiCMOS  technology,  but  unfortunately  the  RAM  contained  a  small  number  of 
design  errors  that  made  testing  difficult.  We  were  able  to  get  around  most  of 
the  problems  by  changing  some  reference  levels  and  using  a  ion  milling 
system,  but  this  greatly  limited  the  number  of  parts  we  were  able  to  test.  The 
resulting  access  time  was  about  3.6ns,  but  the  parts  we  were  able  to  measure 
were  from  the  slow  split,  so  we  think  a  correct  version  of  this  RAM  would  be 
even  faster.  The  write  timing  was  also  measured,  and  the  minimum  write 
pulse  width  was  4ns.  Again  since  one  of  the  errors  in  the  design  involved  the 
write  reference  level,  we  feel  that  a  corrected  design  will  have  an  even  shorter 
write  pulse. 

Since  the  entire  design  was  completed  in  two  months,  we  are  now  looking  at 
the  design  in  more  detail,  trying  to  find  ways  to  either  improve  its 
performance  or  noise  immunity,  or  to  decrease  its  power  dissipation.  These 
changes  would  then  be  incorporated  into  a  revised  version  of  this  part. 

The  work  on  the  TLB  was  extremely  successful,  with  the  final  data  indicating 
a  complete  translation  time  of  3.6ns.  Removing  the  input  and  output  delay 
gives  an  internal  translation  time  of  under  3ns.  We  are  now  looking  at  ways 
to  improve  the  TLB  by  trying  to  reduce  the  600mW  power  it  requires.  It 
looks  possible  to  replace  some  of  the  static  pulldown  currents  with  clocked 
devices  without  changing  the  access  time.  This  basic  CAM/RAM  cell  can  be 
used  to  build  a  high-speed  programmable  logic  element,  and  we  are  starting 
to  look  into  this  application  a  little  more. 

During  this  period  we  have  also  been  working  on  Bisim,  our  BiCMOS 
simulator.  We  have  developed  a  set  of  piece- wise  linear  models  for  bipolar 
and  MOS  devices,  and  have  a  set  or  algorithms  for  finding  final  voltages  and 
timing  for  these  piece-wise  linear  elements.  The  final-value  code  looks  good. 
We  were  able  to  simulate  a  large  class  of  circuit  forms  using  it,  including  the 
BiCMOS  sRAM,  We  are  now  concentrating  on  the  timing  models  for  these 
circuits.  The  initial  model  that  we  developed  works  well  for  networks  with 
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“gfintle”  non-linearities,  but  has  difficulties  solving  networks  with  diodes, 
which  geiierate  sharp  changes  in  the  current  flow. 
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Abstract 

The  Stanford  DASH  multiprocessor  advances  the  state  of  parallel  computing  by  combining 
the  programmability  of  shared-memory  machines  with  the  scalability  of  message-passing 
machines.  The  key  idea  on  which  DASH  is  built  is  that  of  distributed  directory-based  cache 
coherence.  Shared  memory  in  the  machine  is  distributed  among  the  processing  nodes,  and 
processor  caches  are  kept  coherent  by  sending  point-to-point  messages  between  the  nodes  on 
an  interconnection  network.  The  network  provides  the  scalable  intcrprocessor 
communication  bandwidth  and  the  hardware-supponed  coherent  caches  provide  for  low 
latency  access  to  shared  data  and  ease  of  programming. 

The  prototype  DASH  machine  will  consist  of  64  high  performance  microprocessors,  with  an 
aggregate  performance  of  over  1200  MIPS  and  250  scalar  MFLOPS.  This  paper  discusses 
the  mechanisms  for  providing  scalable  memory  bandwidth,  reducing  and  tolerating  memory 
latency,  and  supporting  efficient  synchronization.  The  paper  also  describes  the  preliminary 
design  of  the  machine,  discusses  various  implementation  issues,  and  contrasts  our  approach 
to  other  proposals  with  similar  goals. 


Key  Words  and  Phrases:  scalable  shared-memory  multiprocessors,  distributed 
directory-based  cache  coherence,  DASH,  memory  latency,  memory  consistency, 
synchronization. 
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Abstract 

The  Stanford  DASH  multiprocessor  advances  the  state  of  parallel  computing  by  combining  the 
programmability  of  shared-tticmory  machines  with  the  of  message-passing  machines. 

The  key  idea  on  which  DASH  is  built  is  that  of  t.'ied  diiecP'iy-based  cache  coherence. 
Shared  memory  in  the  machine  is  distributed  among  Q.i.  piocessing  r'xles,  and  processor  caches 
are  kept  coherent  by  sending  point-to-point  messages  between  the  nodes  on  an  interconnection 
netwo±.  The  network  provides  the  scalable  inteiprocessor  communication  bandwidth  and  the 
hardware-supported  coherent  caches  proxide  for  low  latency  access  to  shared  data  and  ease  of 
programming. 

The  prototype  DASH  machine  will  consist  of  64  high  perfoimance  m  croprocessors,  with  an 
aggregate  performance  of  over  1200  MIPS  and  250  scalar  MFLOPS.  Tnis  paper  discusses  the 
mechanisms  for  providing  scalable  memory  bandwidth,  reducing  and  tolerating  memory  latency, 
and  supporting  efficient  synchronization.  The  paper  also  describes  the  preliminary  design  of  the 
machine,  discusses  various  implementation  issues,  and  contrasts  our  approach  to  othe^  {.  /Ojtosals 
with  similar  goals. 


1  Introduction 

We  arc  airrcntly  building  a  shared-memory  multiprocessor,  called  DASH  (Directory  Architecture 
for  SHared -memory),  at  Stanford’s  Computer  Systems  Laboratory.  The  fundamental  premise  of  this 
research  is  that  it  is  possible  to  build  a  scalable  shared-memory  machine  by  using  a  distributed 
directory-based  cache  coherence  protocol.  The  shared  memory  in  our  machine  is  partitioned  among 
the  processing  nodes,  which  are  connected  by  a  scalable  interconnection  network.  A  distributed 
directory  associated  with  main  memory  stores  the  identities  of  all  processors  caching  each  memory 
block.  This  allows  processor  caches  to  be  kept  coherent  by  sending  pomt-to-point  messages.  The 
DASH  architecture  combines  the  advantages  of  several  existing  and  proposed  architectures  into  a  single 
system.  These  advantages  include  suppon  for  shared  memory  to  make  programming  easy,  coherent 
caches  to  reduce  memory  latency  and  make  the  system  efficient,  and  a  directory-based  coherence 
protocol  to  provide  scalability. 

DASH  is  designed  to  be  a  general  purpose  machine.  Our  goal  is  so  s  apport  a  wide  variety  of  appli¬ 
cations  in  engineering  and  science — not  be  restricted  to  a  few  higliiy  structured  parallel  applications. 
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Therefore,  ii  is  important  to  support  a  shared-memory  model,  where  processors  can  directly  access 
all  memory  in  the  system.  Direct  access  eases  the  problem  of  data  partitioning  and  dynamic  load 
distribution.  More  importantly,  shared  memory  makes  it  easier  to  implement  parallelizing  compilers, 
which  we  believe  are  essential  for  parallel  processing  to  gain  widespread  usage. 

Current  shared-memory  machines  have  one  of  two  drawbacks:  either  they  are  not  scalable,  or  if 
they  scale  well,  they  do  not  support  coherent  caches,  which  significantly  reduces  the  performance  of 
the  individual  processors. 

Examples  of  non-scalable  multiprocessors  are  the  Encore  Multimax  and  Sequent  Symmetry.  The 
primary  rcason  for  their  non-scalability  is  their  bus-based  interconnect.  While  the  bus  provides  a 
fixed  bandwidth  interconnect,  the  bandwidth  requirements  of  these  machines  increase  linearly  with 
the  number  of  processors.  There  are  two  reasons  for  this.  First,  the  total  bandwidth  needed  to 
service  regular  cache  misses  increases  linearly  with  the  number  of  processors.  Second,  the  bandwidth 
required  for  invalidation  and  update  messages  to  keep  the  caches  coherent  increases  linearly,  since  the 
fraction  of  memory  references  per  processor  that  cause  coherence  traffic  is  expected  to  stay  consumi 
as  processors  are  added.  On  the  positive  side,  bus-based  machines  are  easy  to  design,  and  arc  effective 
when  only  a  small  number  of  processors  is  needed. 

The  BBN  Butterfly  [20]  and  IBM  RP3  [18]  multiprocessors  are  examples  of  scalable  machines. 
Both  these  marhin#**  distribute  the  shared  memory  among  the  processors  and  use  a  scalable  inter¬ 
connection  network.  However,  existing  machines  in  this  class  either  do  not  provide  caches  (as  in 
Butterfly),  or  do  not  provide  for  hardware  cache  coherence  (as  in  RP3).  In  the  latter  case  they  either 
don’t  cache  shared  writable  data,  or  rely  on  compiler  or  programmer  directed  cache  management. 
Unfortunately,  compiler  technology  is  not  advanced  enough  to  do  a  good  job  of  deciding  what  to 
cache  and  for  how  long.  For  some  classes  of  applications  and  languages  (e.g.,  C-based  programs), 
it  is  unclear  that  compiler  technology  will  ever  be  effective.  When  high  performance  processors  are 
used,  the  performance  of  this  clas'-  of  machines  is  primarily  limited  by  the  shared-memory  accesses 
that  have  to  go  over  the  network. 

In  contrast  to  these  machines.  DASH  provides  for  both  scalability  and  hardware  coherent  caches. 
The  high-level  organization  of  the  machine  is  shown  in  Figure  1 .  The  architecture  consists  of  a  number 
of  processing  nodes  connectetl  through  a  low-latency  interconnection  network.  Physical  memory  is 
distributed  among  the  nodes.  Each  processing  node,  or  cluster,  consists  of  a  small  number  of  high- 
performance  processors  with  their  individual  caches  and  a  portion  of  the  shared  memory  connected  by 
a  snooping  bus.  The  cluster  interfaces  to  the  interconnection  network  through  the  directory  controller. 
The  directory  controller  includes  directory  memory  that  tracks  the  remote  caching  state  of  the  memory 
within  the  cluster,  and  the  remote  access  cache  which  tracks  the  outstanding  requests  made  by  the 
processors  within  the  cluster.  The  remote  access  cache  also  provides  a  buffer  for  incoming  network 
replies.  Its  snoopy  cache  structure  allows  it  to  merge  outstanding  requests  from  different  processors 
and  supplement  the  functionality  of  the  processor  caches.  The  snoopy  scheme  keeps  caches  coherent 
within  a  cluster  while  inter-cluster  coherence  is  maintained  using  a  distributed  directory-based  protocol. 

The  concept  of  directory -based  cache  coherence  was  first  proposed  by  Tang  [21],  However,  we 
have  had  to  extend  it  to  work  with  distributed  shared-memory  [2,  16].  Each  processing  node  in 
DASH  has  a  directory  memory  corresponding  to  its  portion  of  the  shared  physical  memory.  For 
each  memory  line,  the  directory  stores  the  identities  of  all  remote  nodes  caching  that  line.  Using  the 
directory  memory,  a  node  writing  a  location  can  send  point-to-point  invalidation  or  update  messages 
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Figure  1;  The  DASH  architecture  coosisu  of  a  set  of  processing  nodes  connected  by  a  general 
interconnection  netwoiic.  Directory  memory  contains  pointers  for  each  memory  line  to  the  nodes 
currently  caching  that  line.  The  remote  access  cache  mainiairx!  state  on  the  on-going  remote  requests. 
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to  those  nodes  that  are  actually  caching  that  line.  This  is  in  contrast  to  the  invalidating  broadcast 
required  by  the  snoopy  protocol.  The  scalability  of  DASH  depends  on  this  ability  to  avoid  broadcasts. 
Another  important  attribute  of  the  directory-based  protocol  is  that  it  does  not  depend  on  any  specific 
interconnection  network  topology.  As  a  result,  we  can  readily  use  any  of  the  low-latency  scalable 
networks,  such  as  meshes  or  hypercubes,  that  were  originally  developed  for  distributed  memory 
message-passing  machines  [6]. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  the  fundamental  concepts 
of  distributed  directory  based  architectures.  Section  3  describes  the  DASH  architecture  and  then 
discusses  how  our  design  addresses  some  of  the  important  problems  faced  by  scalable  shared-memory 
multiprocessors.  Section  4  discusses  implementation  issues.  Section  5  contrasts  our  approach  with 
other  recent  proposals  for  scalable  shared-memory  multiprocessors.  Finally,  we  present  conclusions 
and  summarize  the  cunent  status  of  the  prototype  development. 

2  Distributed  Directory-Based  Coherence 

The  key  to  scalability  of  any  multiprocessor  lies  in  the  ability  to  scale  the  memory  bandwidth  available 
to  each  processor.  DASH  provides  scalable  bandwidth  for  private  and  shared  data  in  local  memory 
by  distributing  the  physical  memory  among  the  processing  nodes.  To  achieve  the  best  performance, 
the  operating  system  should  allocate  the  local  data  of  a  process  in  the  memory  local  to  the  processing 
node.  For  shared  data  accesses  that  must  be  serviced  remotely,  DASH  uses  a  scalable  interconnection 
network.  The  amount  of  global  traffic  on  the  network  is  reduced  by  providing  coherent  caches  for 
shared  data.  Although  the  network  bandwidth  still  needs  to  scale  linearly,  caching  shared  data  reduces 
access  latency,  and  allows  a  better  match  between  local  processing  speed  and  the  achievable  bandwidth 
of  the  global  interconnect 

While  coherent  caches  are  valuable,  they  generate  coherence  traffic  and  require  extra  hardware 
support  Focusing  first  on  the  coherence  traffic  generated  by  the  directory-based  protocol,  we  do  not 
expect  it  to  cause  scalability  problems.  When  a  processor  modifies  a  shared  data  location,  the  directory- 
based  protocol  sends  point-to-point  invalidation  messages  to  all  other  processing  nodes  caching  that 
location.  Since  invalidations  need  only  be  sent  to  those  remote  caches  that  had  previously  fetched  the 
data,  the  invalidation  traffic  will  be  within  some  small  constant  factor  of  the  regular  data  traffic.  Since 
DASH  provides  scalable  interconnection  netwotk  bandwidth  and  scalable  bandwidth  to  the  directory 
memory  (the  directory  memory  and  controllers  are  themselves  distributed  among  the  processing  nodes), 
the  invalidation  traffic  does  not  cause  a  scalability  problem.  In  fact,  due  to  the  locality  in  shared  data 
references  [1],  we  expect  each  data  item  to  be  referenced  several  times  before  being  invalidated,  thus 
resulting  in  a  significant  reduction  in  global  traffic. 

The  major  scalability  concern  then  becomes  the  amount  of  directory  memory  needed  to  keep  track 
of  the  processing  nodes  that  cache  any  given  memory  line.  If  the  physical  memory  in  tht  machine 
grows  linearly  with  the  number  of  processing  nodes,  then  using  a  bit-vector  to  keep  track  of  all  nodes 
caching  a  memory  line  does  not  scale  well.  The  total  amount  of  directory  memory  needed  is  x  A'/I 
Mbits,  where  P  is  the  number  of  processing  nodes,  K  is  the  Mbytes  of  memory  per  processing  node, 
and  L  is  the  cache-line  size  in  bytes.  Thus,  the  memory  overhead  of  the  simple  directory  scheme 
is  P/A,  which  is  proportional  to  the  number  of  processing  nodes  and  inversely  proportional  to  the 
cache  line  size.  Depending  on  the  size  of  the  machine,  this  growth  with  P  may  be  easily  tolerable  or 
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Figure  2:  Cache  invalidation  patterns  for  MP3D  and  PTHOR.  MP3D  uses  a  particle-based  simulation 
technique  to  determine  the  structure  of  shock  waves  caused  by  objects  flying  at  high-speed  in  the 
upper  atmosphere.  PTHOR  is  a  parallel  logic  simulator  based  on  the  Chandy-Misra  algorithm. 

can  be  remedied  by  keeping  a  lesser  amount  of  information  per  memory  line.  For  example,  consider 
a  machine  with  32  processing  nodes,  each  processing  node  with  8  processors,  and  assume  that  the 
cache  line  size  is  32  bytes.  For  this  256  processor  machine,  the  overhead  of  the  directory  memory  is 
only  \2.S%\  comparable  with  the  overhead  usually  tolerated  to  support  an  error  correcting  code  on 
memory. 

Our  study  of  parallel  applications  has  shown  that  in  most  situations,  a  memory  location  is  cached 
by  only  a  small  number  of  processing  nodes  when  written.  Figure  2  shows  the  cache  invalidation 
patterns  as  observed  for  the  MP3D  and  PTHOR  applications  when  nmning  on  a  32  processor  machine. 
The  graphs  show  that  most  writes  cause  invalidations  to  only  a  few  caches  [22].  Consequently,  it  is 
possible  to  replace  the  complete  bit-vector  in  the  directory  by  a  small  number  of  pointers,  and  to  use 
broadcast  invalidations  in  the  unusual  case  when  the  number  of  pointers  provided  is  too  small  [2]. 
We  are  currently  investigating  the  performance  of  other  possible  schemes  that  limit  the  number  of 
clusters  that  can  cache  a  given  line  or  use  explicit  overflow  tables  when  the  number  of  shared  copies 
is  greater  than  the  number  of  pointers. 

The  above  discussion  assumes  that  the  accesses  are  uniformly  distributed  across  the  machine.  The 
uniform  access  assumption  does  not  hold  for  highly  contended  synchronization  objects[22].  DASH 
provides  special  extensions  to  the  directory -based  protocol  to  handle  synchronization  references,  as 
discussed  in  Section  3.4. 

The  uniform  access  assumption  also  does  not  hold  for  references  to  heavily  used  shared  data 
objects,  for  example,  references  to  the  variable  bolding  the  best  solution  in  a  branch-and-bound 
algorithm.  This  concentrated  accessing  of  data  from  the  memory  of  a  single  processing  node  over 
some  short  duration  of  time  constitutes  a  hot  spot.  Under  hot  spot  conditions,  the  memory  and  network 
bandwidth  are  reduced  since  the  distribution  of  resources  is  not  exploited  as  it  is  under  uniform  access 
patterns.  DASH  allows  shared  data  to  be  cached  and  kept  coherent,  thus  avoiding  hot  spots  that  may 
occur  in  other  parallel  machines  that  do  not  permit  caching  of  writable  shared  data.  In  the  example 
of  the  branch-and-bound  algorithm,  the  bound  is  not  updated  very  often  in  later  stages  of  the  search, 
but  it  is  accessed  for  each  move.  In  DASH,  the  bound  will  be  cached  in  the  processing  nodes  and 
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there  will  be  only  limited  hot-spotting  when  the  bound  is  changed.  In  other  architectures,  the  current 
bound  will  be  a  perpetual  hot  spot,  unless  the  algorithm  is  restructured. 

The  issue  of  memory  access  latency  becomes  more  prominent  as  an  architecture  is  scaled  to  a 
larger  number  of  nodes.  There  are  two  complementary  approaches  for  managing  latency:  methods 
that  reduce  latency,  and  mechanisms  that  help  tolerate  large  latencies.  To  provide  a  general-purpose 
architecture,  DASH  uses  both  of  these  approaches.  Om  primary  focus  has  been  to  reduce  the  latency 
for  accesses  before  trying  to  tolerate  the  latency.  Techniques  that  depend  on  tolerating  latency  demand 
a  large  degree  of  parallelism  to  overlap  communication  and  computation.  For  example,  if  the  latency 
is  /  then  /  x  P  degrees  of  parallelism  is  needed  to  fully  mask  the  latency.  Since  not  all  applications 
exhibit  such  large  parallelism,  we  emphasize  latency  reduction  techniques  first. 

Towards  the  goal  of  reducing  latency  for  remote  accesses,  the  hardware  cache-coherence  protocol 
permits  processon  to  cache  writable  shared  data.  As  will  be  discussed  in  section  3.1,  the  directory- 
based  coherence  protocol  has  been  specially  optimized  for  reducing  latency.  We  expect  the  caching 
of  shared  data  to  significantly  reduce  the  average  latency  for  remote  accesses  because  of  the  spatial 
and  temporal  locality  in  program  accesses  [1].  Furthermore,  the  distribution  of  shared  memory  among 
the  processing  nodes  allows  the  programmer  or  ccxnpiler  to  exploit  physical  locality  by  placing  data 
close  to  the  node  that  will  access  it  most 


3  The  DASH  Architecture 

The  DASH  architecture  has  a  two-level  structure.  At  the  top  level,  the  architecture  consists  of  a 
set  of  processing  nodes  (clusters)  connected  through  a  mesh  interconnection  network.  In  turn,  each 
processing  node  is  a  bus-based  multiprocessor.  Intra-cluster  cache  coherence  is  implemented  using  a 
snoopy  bus-based  protocol,  while  inter-cluster  coherence  is  maintained  through  a  distributed  directory- 
based  protocol. 

The  cluster  functions  as  a  high-performance  processing  node.  A  bus-based  cache  protocol  is 
chosen  for  implementing  small-scale  shared-memory  multiprocessors  because  the  bus  bandwidth  is 
sufficient  to  support  a  small  number  of  processors.  The  bus-based  design  allows  for  fine-grain  sharing 
within  each  cluster,  and  the  directory-based  scheme  provides  scalability  at  the  inter-cluster  level.  The 
grouping  of  multiple  processors  on  a  bus  within  each  cluster  also  enables  the  sharing  of  resources 
among  these  processors.  Furthermore,  this  grouping  reduces  the  directory  memory  requirements  by 
keeping  track  of  cached  lines  at  a  cluster  as  opposed  to  processor  level. 

3.1  The  Coherence  Protocol  and  Memory  Hierarchy 

The  directory-based  protocol  implements  an  invalidation-based  coherence  protocol.  A  memory  loca¬ 
tion  may  be  in  one  of  three  states.  It  may  be  uncached,  that  is  not  cached  by  any  processing  node  at 
all:  shared,  that  is  in  an  unmodified  state  in  the  caches  of  one  or  more  nodes;  or  dirty,  that  is  modified 
in  a  single  cache  of  some  node.  The  directory  keeps  the  summary  information  for  each  memory  line, 
specifying  the  clusters  that  are  caching  it 

An  ideal  system  would  have  uniform,  low  latency  access  to  all  memory  locations.  However,  this  is 
impossible  in  any  large-scale  system.  Memory  hierarchy  is  a  well-known  technique  to  reduce  memory 
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Figure  3;  Memory  hierarchy  of  DASH, 
latency  by  taking  advantage  of  physical  locality. 

The  DASH  memory  system  can  be  logically  broken  into  four  levels  of  hierarchy  as  Ulustrated  by 
Figure  3.  The  level  closest  to  the  processor  is  the  processor  cache  and  is  designed  to  match  the  speed 
of  the  processor.  A  request  that  cannot  be  serviced  by  the  processor  cache  is  sent  to  the  second  level 
in  the  hierarchy,  the  local  cluster  level.  This  level  consists  of  other  processors'  caches  and  the  remote 
access  cache  within  the  requesting  processor’s  cluster.  If  the  data  is  locally  cached,  the  request  can 
be  serviced  within  the  cluster,  otherwise  the  request  is  sent  to  the  directory  home  level.  The  home 
level  consists  of  the  cluster  that  contains  the  directory  and  physical  memory  for  a  given  memory 
address.  For  some  addresses,  the  local  and  home  cluster  are  the  same  and  the  second  and  third  level 
access  occur  simultaneously.  In  general,  however,  the  request  will  travel  through  the  interconnect  to 
the  home  cluster.  The  home  cluster  can  usually  satisfy  a  request,  but  when  the  directory  is  in  the 
dirty  state,  or  in  the  shared  state  when  the  requesting  processor  requires  exclusive  access,  the  fourth, 
remote  cluster  level,  must  be  accessed.  The  remote  cluster  level  responds  directly  to  the  local  cluster 
level  while  also  updating  the  directory  level. 

To  illustrate  the  directory  protocol,  we  first  consider  bow  a  processor  read  traverses  the  memory 
hierarchy: 

Processor  Level:  If  the  location  is  present  m  the  processor's  cache,  the  cache  simply  supplies  the 
data.  Otherwise,  the  request  goes  to  the  local  cluster  level. 

Local  Cluster  Level:  In  the  event  that  the  data  resides  in  the  caches  within  the  cluster,  the  data  is 
supplied  to  the  requestor  and  no  state  change  is  required  at  the  directory  level.  If  the  data  is 
dirty  in  one  of  the  processor  caches,  the  remote  access  cache  takes  ownership  of  the  location 
and  supplies  the  requestor  with  the  data.  If  the  request  must  be  sent  beyond  the  local  cluster 
level,  then  the  remote  access  cache  records  the  presence  of  the  outstanding  request  The  remote 
access  cache  can  then  merge  accesses  to  the  same  block  noade  by  other  processors  in  the  cluster. 


This  merging  reduces  hot-spotting  by  filtering  out  all  but  the  first  request  to  a  given  block.  If 
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the  local  cluster  fails  to  satisfy  the  request,  the  request  is  sent  to  the  home  cluster. 

Home  Directory  Level:  The  home  cluster  examines  the  state  of  the  memory  location  and  can  satisfy 
the  request  unless  the  location  is  dirty  in  a  remote  cluster.  If  not  dirty,  the  data  is  sent  to  the 
requestor  and  the  directory  is  updated  to  show  sharing  by  the  requestor.  If  the  location  is  dirty 
remote,  the  request  is  forwarded  to  the  remote  cluster. 

Remote  Level:  The  remote  cluster  sends  the  data  to  the  requestor,  and  writes  back  the  data  to  the 
home  cluster,  which  in  turn  updates  main  memory  and  the  directory  state  appropriately.  The 
remote  cluster  sends  the  data  directly  to  the  requestor,  as  opposed  to  routing  it  through  the 
home,  in  order  to  reduce  the  latency  involved  in  accessing  remote  dirty  locations. 

We  now  consider  the  sequence  of  operations  that  occurs  when  a  location  is  written. 

Processor  Level:  If  the  location  is  dirty  in  the  writing  processor’s  cache,  the  write  can  complete 
inunediately.  Otherwise,  a  read-exclusive  request  is  issued  on  the  local  cluster  to  obtain  exclusive 
ownership  of  the  line  and  retrieve  the  remaining  porbon  of  the  cache  line. 

Local  Cluster  Level:  If  the  cache  line  is  already  dirty  in  one  of  the  caches  within  the  cluster,  then  the 
read-exclusive  request  can  be  serviced  at  the  local  cluster  level  by  a  cache  to  cache  transfer.  In 
this  case,  processors  within  a  cluster  can  alternately  modify  the  same  memory  location  without 
any  intercluster  interaction.  Otherwise,  the  ownership  request  needs  to  be  sent  to  the  home 
cluster. 

Home  Directory  Level:  The  home  clusta  can  immediately  satisfy  the  ownership  request  if  the  mem¬ 
ory  location  is  not  cached  by  any  remote  clusters.  If  the  cache  line  is  in  shared  state,  all  copies 
of  the  data  must  be  invalidated.  The  contents  of  the  directory  memory  indicate  the  processing 
nodes  that  need  to  be  invalidated.  Invalidation  requests  are  sent  to  all  clusters  sharing  the  cache 
line  except  the  requesting  cluster.  Meanwhile,  the  home  cluster  sends  exclusive  data  to  the 
local  cluster  which  releases  the  requesting  processor.  When  the  processor  re-issues  the  request 
it  obtains  the  line  and  removes  any  shared  copies  from  the  other  local  processors’  caches. 

Remote  Level:  If  the  cache  line  is  in  dirty  state  in  some  remote  cluster,  the  write  request  is  forwarded 
to  the  remote  cluster.  The  cluster  sends  the  data  to  the  requesting  cluster  and  a  directory  update 
message  to  the  home  directory.  The  home  sends  an  acknowledgment  to  the  requesting  cluster 
when  the  update  is  complete.  If  the  home  bad  indicated  a  shared  state,  then  the  remote  clusters 
simply  invalidate  the  cache  line  when  they  receive  the  invalidation  request,  and  then  send  an 
acknowledgement  to  the  requesting  cluster. 

When  the  writing  processor  receives  all  the  invalidation  acknowledgments,  or  the  acknowledgment 
from  the  home  directory  that  the  state  has  been  changed,  it  is  guaranteed  that  the  directory  state  is 
current  and  all  copies  of  the  old  data  have  been  purged  from  the  system.  If  the  processor  delays 
completing  the  write  until  all  acknowledgments  are  back,  then  the  new  value  will  become  available  to 
all  processors  at  the  same  time.  However,  invalidations  may  involve  round  trip  messages  to  muJtipie 
caches  through  the  network,  resulting  in  a  long  wait  for  the  acknowledgment  messages.  We  can 
achieve  higher  processor  utilization  if  we  allow  the  write  to  proceed  immediately  after  the  ownership 
request  is  satisfied.  However,  this  can  cause  logical  problems  in  the  computational  model  for  the 
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multiprocessor.  The  next  subsection  describes  how  DASH  provides  minimum  constraints  on  memory 
request  ordering,  while  still  providing  a  reasonable  programming  model  to  the  user. 

3^  Memory  Consistency 

The  level  of  memory  consistency  required  by  an  architecture  directly  affects  the  amount  of  buffering 
and  pipelining  that  can  take  place  among  memory  requests.  In  addition,  it  has  a  direct  effect  on  the 
complexity  of  the  programming  model  presented  to  the  user.  Our  goal  in  DASH  is  to  provide  ntaximum 
freedom  in  the  ordering  among  memory  requests,  while  still  providing  a  reasonable  programming 
model  to  the  user. 

We  begin  the  discussion  on  consistency  with  some  definitions.  A  write  operation  is  considered 
globally  performed  when  a  read  by  any  processor  will  return  the  value  of  that  write  or  some  write  that 
logically  follows  that  write  [7].  .A  read  operation  is  globally  performed  when  the  result  of  the  read 
is  bound  to  the  value  of  some  write  that  has  been  globally  performed  (including  a  write  by  another 
processor). 

At  one  end  of  the  consistency  spectrum  is  the  sequential  consistency  model  [15],  which  requires 
the  execution  of  the  parallel  program  to  appear  as  some  interleaving  of  the  execution  of  the  parallel 
processes  on  a  sequential  machine.  Strong  ordering  [7]  provides  sufficient  requirements  for  achieving 
sequential  consistency.  Sufficient  conditions  for  strong  ordering  require  a  memory  request  to  be 
globally  performed  before  the  next  request  by  the  same  processor  can  be  issued. 

Strong  ordering,  while  conceptually  appealing,  imposes  a  large  performance  penalty  on  writes.  It 
prohibits  the  processor  from  completing  a  write  until  all  related  coherence  traffic  has  been  propagated 
throughout  the  system.  For  most  applications,  such  a  model  is  too  strict  and  one  can  make  do  with  a 
much  weaker  notion  of  consistency. 

A  weaker  consistency  model  can  be  derived  by  relating  memory  request  ordering  to  synchronization 
points  in  the  program.  As  an  example,  consider  the  case  of  a  processor  up>dating  a  data  structure  within 
«  critical  section.  If  updating  the  data  structure  requires  several  writes,  in  the  strongly  ordered  system, 
each  write  will  stall  until  all  other  cached  copies  of  that  location  have  been  invalidated.  But  these 
stalls  are  unnecessary  as  the  programmer  has  already  made  sure  that  no  other  process  can  rely  on  that 
data  strtictuie  being  consistent  until  the  critical  section  is  exited.  If  we  can  identify  the  synchronization 
points,  we  need  only  ensure  that  the  memory  is  consistent  at  those  points.  Therefore,  the  consistency 
can  be  relaxed  by  only  requiring  the  operations  to  be  globally  performed  before  a  release  operation  is 
observed.  We  call  this  model  of  consistency  release  consistency  [10], 

Such  a  scheme  has  two  advantages.  First,  it  provides  the  user  with  a  reasonable  programming 
model,  since  the  programmer  is  ensured  that  when  the  critical  section  is  exited,  all  other  processors  will 
have  a  consistent  view  of  the  modified  data  structure.  Second,  it  permits  the  invalidations  of  different 
write  operations  to  overlap,  resulting  in  lower  latencies  for  writes  and  higher  overall  performance.  The 
disadvantage  of  the  model  is  that  all  synchronization  accesses  must  be  identified  by  the  programmer 
or  the  compiler. 

The  release  consistency  model  is  an  extension  to  the  weak  consistency  model  [7].  The  two  models 
are  similar  in  that  memory  needs  to  be  consistent  only  at  the  time  of  synchronization.  Release 
consistency,  however,  distinguishes  between  releasing  (i.e.  unlock)  and  non-releasing  (i.e,  lock) 
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synchronization  operations.  Release  consistency  only  delays  the  performing  of  a  release  operation 
until  previous  operations  are  performed.  Weak  consistency  requires  delaying  of  all  synchronization 
operations. 

DASH  provides  support  for  both  strong  and  release  consistency  models.  This  is  achieved  by 
having  a  highly  parallel  read  and  vvnte  protocol  that  allows  the  memory  to  be  inconsistent  However, 
the  protocol  has  minimum  constraints  that,  when  coupled  with /ence  instructions,  can  enforce  different 
consistencies.  A  fence  operation  [4]  stalls  the  execution  of  subsequent  instructions  until  previous  read 
and  write  operations  are  globally  performed. 

DASH  supports  the  release  consistency  model  in  hardware.  Read  operations  block  until  the  load 
data  is  returned  from  the  cache  or  the  lower  levels  of  the  memory  hierarchy.  Write  operations,  however, 
are  non>blockmg.  There  is  a  write  buffer  between  the  first  and  second  level  caches.  The  write  buffer 
queues  up  the  write  requests  and  issues  them  in  order.  Furthermore,  the  servicing  of  the  write  requests 
are  overlapped.  As  soon  as  the  cache  receives  the  ownership  and  data  for  the  requested  cache  line, 
the  write  data  is  removed  from  the  write  buffer  and  written  into  the  cache  line.  The  next  write  request 
can  be  serviced  while  the  invalidation  acknowledgments  for  the  previous  write  operations  filter  in. 
Thus  parallelism  exists  at  two  levels:  the  processor  may  execute  other  instructions  while  a  preceding 
write  operation  takes  place,  and  invalidations  of  multiple  write  operations  may  overlap. 

DASH  provides  fence  operations  to  allow  emulation  of  other  consistency  models  in  software. 
There  are  two  kinds  of  fence  instructions.  A  full  fence  instruction  that  stalls  a  processor  until  all 
previous  operations  are  globally  performed.  To  implement  strong  consistency,  it  is  sufficient  for  the 
processor  to  issue  a  full  fence  instruction  after  every  write  to  a  possibly  shared  location.  DASH  also 
has  a  write  fence  operation.  The  write  fence  instruction  does  not  stall  the  processor,  it  queues  up  in 
the  write  buffer  and  a  processor  can  continue  executing  instructions  including  issuing  write  operations 
into  the  write  buffer.  A  write  fence  does  not  complete,  however,  until  all  previous  writes  from  that 
processor  have  been  performed.  Release  consistency  can  be  implemented  by  queuing  a  write  fence 
operation  with  the  unlock  operation  in  the  write  buffer.  By  the  time  the  unlock  operation  is  observed, 
all  previous  operations  have  been  performed.  The  fence  operations  are  implemented  using  counters 
that  count  the  number  of  outstanding  memory  requests.  The  fence  operation  succeeds  when  there  are 
no  outstanding  requests. 

3  J  Memory  Access  Optimizations 

The  use  of  release  consistency  hides  some  of  the  memory  latency  for  remote  write  operations.  How¬ 
ever,  remote  read  operations  still  stall  the  processor  for  the  entire  duration  of  access.  In  addition,  since 
read  operations  are  blocking,  no  pipelining  occurs  on  read  requests  initiated  by  the  same  processor. 
The  weak  constraints  of  release  consistency  allow  us  to  remedy  this  by  using  a  variety  of  prefetch  and 
pipelining  techniques.  DASH  provides  several  optimizations  to  hide  the  latency  of  memory  operations. 

3J.1  Prefetch  Operations 

A  prefetch  operation  is  an  expUcit  non-blocking  request  to  fetch  the  data  before  the  actual  memory 
operation  is  issued.  Hopefully,  by  the  time  the  process  needs  that  data  value,  the  dau  has  already  been 
brought  into  its  cache,  thus  hiding  the  latency  of  the  regular  blocking  fetch.  As  a  simple  example  of 
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its  use,  a  process  wanting  to  access  a  row  of  a  matrix  stored  in  another  processing  node's  memory 
can  do  so  efficiently  by  first  issuing  prefetch  reads  for  all  cache  blocks  corresponding  to  that  row. 

The  prefetch  operation  in  DASH  brings  the  dau  into  the  remote  access  cache.  A  subsequent 
memory  access  will  then  be  serviced  by  this  cache.  The  prefetched  data  is  kept  consistent  like  any 
other  data.  If  another  processor  happens  to  write  into  the  cache  line  before  the  data  is  accessed, 
the  data  will  simply  be  invalidated.  The  prefetch  will  be  rendered  ineffective,  but  the  program  will 
execute  correctly.  This  makes  the  programming  and  compilation  task  easier  as  prefetching  a  location 
that  may  be  touched  by  another  processor  will  not  result  in  incorrect  execution.  Research  on  compiler 
optimizations  to  prefetch  data  is  currently  under  way. 

3  Update  and  Deliver  Operations 

In  the  prefetch  scheme,  by  the  time  the  consumer  process  realizes  that  it  should  prefetch  some  data  it 
may  be  close  to  using  it  Latency  will  not  be  effectively  hidden  in  these  cases.  Likewise,  if  multiple 
consumers  need  the  same  itcoi  ci  data,  the  communication  traffic  can  be  reduced  if  data  is  multicast  to 
all  the  consumers  simultar'^  jusly.  Therefore,  DASH  provides  instructions  that  allow  the  producer  to 
send  the  data  directly  to  the  consumers.  Another  advantage  of  these  mechanisms  is  that  the  producing 
node  knows  exactly  when  the  data  is  ready.  The  disadvantage  of  this  scheme  is  that  it  may  clutter 
the  remote  caches  with  data  that  the  consumers  are  not  yet  ready  to  consume.  The  scheme  also 
requires  that,  the  consumer  addresses  be  supplied.  There  are  two  ways  to  specify  the  addresses.  The 
update-write  operation  sends  the  new  data  directly  to  all  processors  that  have  cached  the  data;  the 
deliver  operation  sends  the  data  to  explicitly  specified  destinations. 

The  update-write  primitive  updates  the  value  of  all  existing  copies  of  a  data  word.  Using  this 
primitive,  a  processor  does  not  need  to  first  acquire  an  exclusive  copy  of  the  cache  line,  which  would 
mean  invalidating  all  other  copies.  Rather,  data  is  directly  written  into  the  home  memory  and  all 
caches  that  are  caching  the  line  containing  the  update  word.  The  caching  state  of  the  data  is  not 
modified.  Update  semantics  are  particularily  useful  for  event  synchronization,  such  as  the  release 
event  from  a  barrier  where  an  update-write  can  be  used  to  release  all  waiting  processors. 

The  deliver  instruction  explicitly  specifies  the  destination  clusten  of  the  transfer.  In  this  scheme, 
the  producer  first  writes  out  the  data  to  the  cache  line  using  regular  write  operations.  At  that  time,  any 
other  cached  copies  are  invalidated.  The  producer  then  issues  a  deliver  instruction,  explicitly  giving 
the  destinations  in  a  bit  vector.  A  shared  copy  of  the  cache  block  is  sent  to  all  the  specified  clusters, 
and  the  directory  is  updated  with  the  fact  that  the  data  is  now  shared  between  the  various  clusters. 

As  an  example  of  a  use  of  the  deliver  instruction,  consider  the  example  of  solving  a  system 
of  linear  equations  using  Gaussian  elimination.  Assume  that  the  load  is  balanced  by  distributing  the 
columns  of  the  matrix  among  the  processors.  The  pivot  column  needs  to  be  accessed  by  all  processors 
at  the  beginning  of  each  iteration  of  the  algorithm,  and  consequently  a  hot  spot  results.  The  deliver 
instruction  can  be  used  to  multicast  the  pivot  coliunn  to  all  the  clusters  working  on  the  problem.  This 
avoids  hot-spotting,  and  the  access  to  the  column  is  significantly  reduced  since  a  processor  needs  only 
to  fetch  it  from  the  remote  access  cache.  No  explicit  synchronization  is  needed  on  completion  of  the 
deliver  operation  since  it  is  simply  a  move  of  the  cache  block.  Any  early  read  of  the  block  will  be 
satisfied  by  the  home  or  dirty  cluster. 
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3.4  Support  for  Synchronization 

The  access  patterns  to  locations  used  for  synchronization  are  often  different  from  those  for  other  shored 
data  [22],  For  example,  whenever  a  highly  contended  lock  is  released,  a  large  number  of  waiting 
nodes  rush  to  grab  the  lock.  In  case  of  barriers,  a  large  number  of  processors  need  to  be  gathered  and 
released  efficiently.  Such  phenomena  often  cause  hot  spots  in  the  memory  system  and  can  degrade 
the  performance  of  large  multiprocessors.  Consequently,  synchronization  variables  usually  warrant 
special  treatment 

DASH  provides  two  extensions  to  the  cache  coherence  protocol  to  support  synchronization  prim¬ 
itives:  queue-based  locks,  and  feich-and-incremeni  and  fetch-and*decrement  operations. 

The  queue-based  locks  provide  for  efficient  implementation  of  locks  and  counting  semaphores. 
Ideally,  locks  should  meet  the  following  criteria:  (i)  minimum  amount  of  traffic  generated  by  spin 
waiting,  (ii)  low  latency  release  of  a  waiting  processor,  (iii)  low  latency  acquisition  of  a  free  lock,  and 
(iv)  keeping  the  first  three  criteria  true  for  high  contention  locks.  Most  architectures  handle  locks  by 
providing  an  atomic  test&set  instruction  and  use  a  cached  test-and-lcst&set  scheme  for  spin  waiting. 
This  reduces  the  amount  of  traffic  generated  during  spin  waiting.  While  such  systems  are  usually 
successful  in  satisfying  the  first  three  criteria  for  low  contention  locks,  they  fail  for  high  contention 
locks.  For  example,  assume  there  are  processors  spinning  on  a  lock  value  in  their  caches.  When 
the  lock  is  released,  all  .V  cache  values  are  invalidated,  and  .V  reads  are  generated  to  the  memory 
system.  Depending  on  the  timing,  it  is  possible  that  all  ,V  processors  come  back  to  do  the  test&set  on 
the  location  once  they  realize  the  lock  is  free,  resulting  in  further  invalidations  and  re-reads  [3],  Such 
a  scenario  produces  much  unnecessary  traffic  and  increases  the  latency  in  acquiring  and  releasing  a 
lock. 

The  queue-based  locks  in  DASH  address  this  problem  by  keeping  a  list  in  directory  memory  of 
the  nodes  spinning  in  their  cache  for  the  lock.  Using  the  directory  memory  for  storing  locks  is  natural 
since  the  directory  is  already  set  up  to  track  nodes  caching  a  data  item.  Furthermore,  the  directory 
is  normally  accessed  in  read-modify-write  cycles  that  match  the  atomic  )q?dates  necessary  for  locks. 
Caching  a  locked  lock  minimizes  network  traffic  while  a  processor  is  spinning.  If  a  lock  is  released 
which  has  waiting  nudes,  one  node  is  choosen  at  random  and  is  granted  the  lock.  The  grant  request 
has  update  semantics  in  that  a  processor  can  acquire  a  granted  lock  by  access  on  the  local  cluster, 
as  opposed  to  invalidating  the  value  in  the  cache  and  having  the  node  re-fetch  the  location  from  the 
home.  This  scheme  lowers  both  the  traffic  and  the  latency  involved  in  releasing  a  node  waiting  on 
a  lock.  Informing  only  one  processor  of  the  release  eliminates  unnecessary  traffic  and  reduces  the 
latency  that  would  be  incurred  if  all  waiting  nodes  were  allowed  to  contend. 

The  fetch-and-increment  and  fetch-and-decrement  primitives  [8]  provide  atomic  increment  and 
decrement  operations  at  the  memory  site.  The  value  returned  by  the  operations  is  the  value  before  the 
increment  or  decrement  These  operations  have  low  latency  and  are  useful  for  implementing  several 
synchronization  primitives  such  as  barriers,  distributed  loops,  and  work  queues.  The  serialization  of 
these  operations  is  small  because  they  are  done  directly  at  the  memory  site.  Serialization  is  based 
on  how  fast  the  memory  can  do  the  next  atomic  operation  on  the  location  once  one  is  initiated.  The 
low  serialization  provided  by  the  fetch-and-increment  operation  is  especially  important  when  a  large 
number  of  processors  want  to  increment  a  location,  as  happens  when  getting  the  next  index  in  a 
distributed  loop.  The  benefits  of  the  proposed  operations  become  apparent  when  contrasted  with  the 
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alternative  of  using  a  lock  to  achieve  an  atomic  increment/decrement  of  a  counter.  The  alternative 
requires  acquiring  the  lock,  reading  the  counter,  incrementing  it  at  the  processor,  writing  it  back,  and 
releasing  the  lock.  This  results  in  more  traffic,  latency,  and  serialization. 

The  preceding  has  focused  on  hardware  support  for  synchronization.  We  are  also  exploring 
software  techniques  that  will  help  reduce  latency,  traffic,  and  hot  spots.  For  example,  the  load  on  a 
single  widely  contended  lock  may  be  distributed  to  multiple  locks  through  software  techniques.  For 
primitives  such  as  barriers,  it  is  possible  to  distribute  the  load  by  gathering  and  releasing  through  a 
tree,  thus  eliminating  heavily  contended  locations.  Using  trees  also  reduces  the  latency  for  the  gather 
and  release  stages.  We  are  currently  studying  such  trade-offs  for  other  synchronization  primitives. 


4  The  DASH  Implementation 

A  prototype  implemenution  of  DASH  is  being  developed  in  concert  with  the  architecture.  While 
we  have  developed  a  detailed  software  fimulator  of  the  system  we  feel  that  a  full  implementation 
is  needed  to  uncover  the  issues  in  the  design  of  a  scalable  cache  coherence  machine,  to  verify  the 
feasibility  of  such  designs,  and  to  provide  a  platform  for  studying  real  applications  running  on  a  large 
ensemble  of  processors. 

To  manage  the  size  of  the  prototype  design  effort,  the  processing  node  is  based  on  a  commercially 
available  bus-based  multiprocessor.  The  prototype  system  attempts  to  meet  the  primary  goals  of 
scalability  and  high  performance  within  the  constraints  imposed  by  the  base  node.  While  these 
constraints  have  compromised  performance  in  a  few  areas,  the  implementation  should  be  indicative 
of  the  performance  we  would  expect  from  an  ideal  DASH  system.  Since  we  believe  that  many  of 
the  architectural  features  under  consideration  can  only  be  fully  evaluated  on  actual  hardware,  we  have 
included  many  of  them  in  the  prototype  system.  To  aid  in  the  evaluation  of  the  system  we  are  also 
providing  dedicated  hardware  monitoring  logic. 

4.1  The  DASH  Processing  Node 

The  prototype  system  uses  the  Silicon  Graphics  POWER  Station  4D/240  as  the  base  processing  node. 
The  4D/240  system  consists  of  four  MIPS  RSOOO  processors  and  R3010  floating-point  coprocessors 
running  at  25  MHz.  Each  R3000/R3010  combmation  is  nominally  rated  at  20  VAX  MIPS  and  4 
LINPACK  MFLOPS.  Each  CPU  contains  a  64  Kbyte  instruction  cache  and  a  64  Kbyte  write-through 
data  cache.  The  64  Kbyte  data  cache  interfaces  to  a  256  Kbyte  second-level  write-back  cache.  The 
interface  consists  of  a  read  buffer  and  a  4  word  deep  write-buffer.  Both  the  first  and  second  level 
caches  are  direct-mapped  and  support  16  byte  lines.  The  first  level  caches  run  synchronous  to  the  25 
MHz  processors  while  the  second  level  cache  runs  synchronous  to  the  16.67  MHz  memory  bus. 

The  second-level  processor  caches  are  responsible  for  bus  snooping  and  maintaining  consistency 
among  all  caches  in  the  cluster.  Consistency  is  maintained  using  the  Illinois  coherence  protocol  [17]. 
The  mam  advantage  of  using  the  Blinois  protocol  in  DA.SH  is  the  cache-to-cache  transfers  specified 
in  this  protocol.  While  this  does  not  improve  performance  for  misses  that  would  be  serviced  by  local 
memory,  it  can  greatly  reduce  the  penalty  for  remote  memory  misses.  The  set  of  processor  caches 
together  with  the  remote  access  cache  act  as  a  cluster  cache  of  remote  memory.  The  memory  bus 
(MPBUS)  of  the  4D/240  is  a  synchronous  bus  and  consists  of  separate  32-bit  address  and  64-bit  data 
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buses.  The  MPBUS  is  pipelined  and  supports  memoiy-to-cache  and  cache -to-cache  transfers  of  16 
byies  every  4  bus  clocks  with  a  latency  of  6  bus  clocks.  Thus,  the  maximum  bandwidth  of  the  bus  is 
67  Mbytes/sec.  While  the  MPBUS  is  pipelined,  it  is  not  a  split  transaction  bus. 

To  use  the  4D/240  in  DASH,  we  have  had  to  make  minor  modihcations  to  the  existing  system 
boards  and  design  a  totally  new  directory  controllo'  board.  The  main  modification  to  the  existing  boards 
is  to  add  a  bus  retry  signal  that  is  used  when  a  request  requires  service  from  a  remote  cluster.  Retry, 
together  with  modifications  to  the  bus  arbitration  logic  to  accept  an  arbitration  mask  from  the  directory 
board,  allow  the  system  to  service  remote  requests  without  occupying  the  local  bus  for  the  entire 
duration  of  the  request  Effectively,  remote  requests  are  handled  using  a  split-transaction  bus  protocol. 
The  new  directory  controller  board  consists  of  the  directory  memory,  the  intercluster  coherence  state 
machines,  and  the  logic  needed  to  support  the  interccmnection  network.  The  interconnection  network 
consists  of  a  pair  of  wormhole  routed  meshes,  each  with  16-bit  wide  cbaimels.  One  mesh  is  dedicated 
to  the  request  messages  while  the  other  handles  replies. 

A  block  diagram  of  four  clusters  interconnected  to  form  a  2  x  2  DASH  systein  is  shown  in 
Figure  4.  Such  a  system  could  scale  to  support  hundreds  or  thousands  of  processors,  but  the  prototype 
will  be  limited  to  a  maximum  4x4  configuration.  This  limit  was  dictated  primarily  by  the  physical 
memory  addressability  of  the  4D/240  system,  but  still  provides  for  systems  of  up  to  64  processors 
that  are  capable  of  over  1200  MIPS  and  250  MFLOPS, 
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Figure  5:  Directory  board  block  diagram. 

4.2  The  DASH  Directory  Board 

The  directory  board  is  respwnsible  for  implementing  the  directory-basesd  coherence  protocol  and  in¬ 
terconnecting  the  clusters  within  the  system.  A  block  diagram  of  the  directory  board  is  shown  in 
Figure  5. 

The  directory  board  consists  of  five  major  subsystems,  The  directory  controller  (DC)  includes 
the  directory  memory  .^ssociated  with  the  cachable  main  memory  contained  within  a  single  cluster.  It 
is  also  responsible  for  initiating  out-bound  network  requests  and  replies.  The  pseudo-CPU  (PCPU) 
is  responsible  for  bu^eiing  incoming  requests  and  issuing  such  requests  onto  the  cluster  bus.  It 
mimics  a  CPU  on  this  bus  on  behalf  of  remote  processors  except  that  responses  from  the  bus  are 
sent  out  by  the  directory  controller.  The  reply  controller  (RC)  tracks  outstanding  requests  made  by 
the  local  processors  and  receives  and  buffers  replies  from  remote  clusters  using  the  remote  access 
cache  (RAC),  The  network  interface  and  the  local  portion  of  the  network  itself  reside  on  the  directory 
card.  The  board  contains  hardware  monitoring  logic  and  miscellaneous  control  and  status  registers. 
The  monitoring  logic  samples  a  variety  of  directory  board  and  bus  events  from  which  usage  and 
performance  statistics  can  be  derived. 

Directory  memory  is  accessed  on  each  cluster  bus  transaction.  The  directory  information  is 
combined  with  the  type  of  bus  operation,  the  address,  and  the  result  of  snooping  otlier  caches  to 
deteruine  what  network  messages  and  bus  controls  the  DC  will  generate.  The  directory  memory  itself 
is  implemented  as  a  bit  vector  with  1  bit  for  each  of  the  16  clusters.  While  a  full  bit  vector  has  limited 
scalability,  it  was  chosen  because  it  requires  roughly  the  same  amount  of  memory  as  a  limited  pointer 
directory  given  the  sixe  of  the  prototype,  and  it  allows  for  more  direct  measurements  of  the  caching 
behavior  of  the  machine.  Each  directory  entry  also  contains  a  single  state  bit  that  indicates  whether 
the  clusters  have  a  read  (shared)  or  read/write  (dirty)  copy  of  the  data.  The  directory  is  implemented 
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using  DRAM  tectuiology,  but  performs  all  necessary  actions  within  a  single  bus  transaction. 

43  Interconnection  Network  ' 

As  stated  in  the  architecture  section,  the  DASH  coherence  protocol  does  not  rely  on  a  particular 
interconnection  network  topology.  However,  for  the  architecture  to  be  scalable,  the  network  itself 
must  provide  scalable  bandwidth.  It  should  also  provide  low  latency  communication.  The  prototype 
system  uses  a  pair  of  wormhole  routed  mesh  networks  to  implement  the  interconnection  network.  The 
network  chips  are  extended  versions  of  the  Caltech  mesh  routing  chips  [9]  and  have  16-bit  wide  data 
paths.  Since  wormhole  routing  allows  nodes  to  forward  a  message  after  receiving  only  the  first  flit 
(flow  unit)  of  the  packet,  the  latencies  in  the  network  are  small.  We  expect  the  latency  through  a  node 
in  the  network  to  be  approximately  40ns.  The  networks  are  asynchronous  and  self-timed.  The  speed 
of  the  network  is  limited  by  the  round  trip  delay  of  the  request-acknowledge  signals.  In  DASH,  we 
expect  the  flits  to  be  transferred  at  30  MHz,  resulting  in  a  total  network  bandwidth  of  240  Mbytes/sec 
to  each  node. 

An  important  constraint  on  the  network  is  that  it  should  be  able  to  deliver  request  and  reply 
messages  generated  by  user  programs  and  the  coherence  protocol  without  deadlocking.  Most  networks 
are  guaranteed  to  be  deadlock  free  if  messages  are  consumed  at  the  receiving  node.  In  general,  this 
assumption  is  not  true  for  DASH.  The  processing  nodes  require  the  ability  to  generate  an  out-going 
message  at  the  same  time  that  they  consume  a  in-coming  message.  For  example,  to  consume  a 
read  request  the  home  cluster  must  generate  a  reply  message  containing  the  data.  Similarly,  to 
consume  a  request  for  a  location  that  is  dirty  in  a  remote  cluster,  the  home  cluster  needs  to  generate 
a  forwarding  request  to  the  remote  cluster.  This  requirement  adds  the  potential  for  deadlocks  that 
consist  of  a  sequence  of  messages  that  have  circular  dependencies  through  a  node.  These  deadlocks 
are  broken  in  two  ways.  The  first  relics  on  the  fact  that  reply  messages  can  always  be  consumed 
without  dependency  on  out-going  messages.  Thus,  having  two  independent  mesh  networks  dedicated 
to  request  and  replies  guarantees  that  there  will  never  be  a  request-reply  dependency  deadlock.  Second, 
a  timeout  mechanism  in  the  request  network  allows  potential  deadlocks  in  the  request  network  (due  to 
request-request  dependencies)  to  be  broken  by  rejecting  requests  through  negative  acknowledge  reply 
messages. 

4,4  Implementing  Cache  Coherence 

A  complete  description  of  the  implementation  of  the  DASH  coherence  protocol  is  beyond  the  scope  of 
this  paper.  The  following  examples,  however,  should  give  some  insight  into  how  the  various  sections 
of  hardware  interact  to  carry  out  the  protocol.  For  a  more  detailed  discussion  see  [16]. 

Figure  6  shows  a  simple  read  of  a  memory  location  whose  home  is  a  remote  cluster  and  whose 
state  in  the  directory  is  dirty  in  another  cluster.  The  request  is  not  satisfied  within  the  local  cluster 
and  is  sent  to  the  home(l).  The  home  cluster  forwards  the  request  to  the  remote  cluster(2)  where  it 
is  in  dirty  state.  The  dirty  cluster  then  responds  directly  to  the  local  cluster(3a)  while  simultaneously 
sending  a  sharing  writeback  to  the  home(3b)  to  update  the  directory  and  main  memory. 

Figure  7  shows  the  sequence  of  operations  generated  by  a  read-exclusive  command  to  a  remote 
address  when  the  directory  is  in  the  shared  state.  The  read-exclusive  conunand  originates  with  a 
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Figure  6:  Flow  of  a  read  request  to  remote  memory. 


processor  write  to  a  cache  line  that  is  either  uncacbed  by  the  processor  or  is  held  in  the  shared  state. 
The  write  is  buffered  in  the  write-buffer,  but  the  write-buffer  must  obtain  exclusive  access  to  the 
line  before  the  write  can  complete.  Assuming  the  line  is  in  the  shared  state  the  request  must  first  be 
sent  to  the  home(l).  The  home  directory  replies  to  the  request  immediaiely(2a)  while  also  sending 
invalidation  requests  to  the  clusters(2b)  currentlj  caching  that  lire.  Once  the  local  cluster  receives 
the  data  block  from  the  directory,  the  release  consistency  model  allows  the  processor  to  be  granted 
exclusive  ownership  of  the  line,  and  for  the  write-buffer  to  retire  the  write.  The  RAC  entry  associated 
with  the  request  persists  until  all  the  associated  invalidation  acknowledges  (3)  have  been  received. 

An  important  feature  of  the  coherence  protocol  is  its  forwarding  strategy.  If  a  cluster  cannot 
directly  reply  to  a  given  request,  it  forwards  responsibility  for  the  request  to  a  cluster  that  should  oe 
able  to  respond.  This  minimize  the  latency  for  a  request,  as  the  request  is  always  forwarded  to  the 
location  where  the  data  is  thought  to  be,  which  can  then  directly  reply  to  the  request.  This  technique 
also  minimizes  the  serialization  of  requests  since  no  cluster  resources  are  blocked  while  inter-cluster 
messages  are  being  sent  Forwarding  allows  the  directory  controller  to  work  on  multiple  requests 
concurrently  (i.e.  make  it  multi-threaded)  without  adding  the  complexity  of  such  multi-threading. 

Performance  of  the  various  operations  is  closely  related  to  the  speed  of  the  MPBUS  and  the 
latency  of  inter -cluster  communication.  For  DASH  we  have  attempted  to  minimize  the  number  of  bus 
and  network  transactions  needed  before  the  processor  can  proceed.  Figure  8  shows  the  latencies  for 
various  memory  operations  assuming  the  system  is  unloaded. 
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Figure  7:  Flow  of  a  read-exclusive  request  to  remote  memory. 
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Figure  8:  Latency  for  various  memory  system  operations  in  number  of  processor  clocks.  Each 
processor  clock  for  the  DA.SH  protoype  is  40ns. 
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5  Alternative  Approaches 


In  this  section,  we  briefly  compare  DASH  to  other  proposed  scalable  architectures  that  support  cache- 
coherent  shared-memory. 

5.1  Encore  GigaMax  and  Stanford  VMP-MC 

The  Encore  GigaMax  architecture  [23]  and  the  Stanford  VMP-MC  project  [5]  both  use  a  hierarchy- 
of-buses  approach  in  building  scalable  shared-memory  architectures.  At  the  top  level,  the  Encore 
GigaMax  is  composed  of  several  clusters  on  a  global  bus.  Each  cluster  consists  of  several  processor 
modules,  a  cluster  cache  that  caches  remote  accesses,  and  main  memory  cotmected  by  a  cluster 
bus.  Each  processor  module  consists  of  several  processors  with  private  caches  and  a  large  shared 
second-level  cache. 

The  VMP-MC  machine  is  similar  to  the  GigaMax  in  terms  of  the  hierarchy  of  processors,  caches, 
and  buses.  It  is  different,  however,  in  that  the  physical  memory  is  all  located  at  the  global  level, 
and  it  uses  a  hierarchical  directory-based  coherence  protocol.  The  clusters  containing  cached  data 
are  identified  by  a  bit- vector  scheme  at  every  level  mstead  of  using  snooping  cluster  caches.  The 
VMP-MC  also  provides  a  lock  bit  per  memory  block  that  enhances  performance  for  synchronization 
and  explicit  communication. 

The  global  bus  is  the  critical  link  in  hierarchical  bus-based  machines.  Unless  an  application’s 
communication  requirements  match  the  bus  hierarchy,  or  it  has  a  very  small  sharing  traffic  require¬ 
ment,  the  global  bus  will  be  a  bottleneck.  Both  requirements  are  restrictive  and  limit  the  classes  of 
applications  that  can  be  rtm  on  these  machines.  We  feel  the  DASH  architecture  is  more  robust  be¬ 
cause:  (i)  it  provides  a  scalable  interconnect  between  the  multiple  clusters  rather  than  a  single  global 
bus,  and  (ii)  it  makes  effective  use  of  this  scalable  point  to  point  interconnection  network  through  the 
directory-based  cache  coherence  protocol. 

S2  Wisconsin  Muiticube 

The  Multicube  architecture  [13]  employs  a  snooping  cache  system  over  a  grid  of  buses.  Each  processor 
has  a  first-level  cache  that  minimizes  latency  and  a  second-level  cache,  referred  to  as  the  snooping 
cache,  that  is  designed  to  minimize  bus  traffic.  Each  snooping  cache  monitors  two  buses,  a  row  bus 
and  a  column  bus,  in  order  to  maintain  data  consistency  among  the  snooping  caches.  In  addition,  there 
are  modified  line  tables  for  snooping  purposes.  The  memory  of  the  machine  is  not  distributed  among 
the  processors  because  the  architects  did  not  want  to  impose  the  notion  of  geographical  locality  onto 
the  software.  In  support  for  synchronization.  Multicube  provides  an  efficient  queuing  mechanism  for 
handling  locks  [12]  that  is  similar  to  the  approach  taken  in  DASH. 

We  feel  that  a  major  drawback  of  Multicube’s  grid-of-buses  design  is  that  with  powerful  processors 
that  need  substantial  silicon/board  area,  the  buses  will  have  to  be  long,  and  consequently  will  not  run 
at  very'  high  speeds.  In  addition,  since-  the  second-level  caches  are  large  they  will  most  likely  be 
built  using  DRAM,  and  snooping  on  these  DRAM  caches  will  be  slow,  further  limiting  bus  speed  and 
increasing  latency  for  accesses. 
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5J  lEEE-SCI  Protocol 


The  IEEE  PI 596,  Scalable  Coherent  Interface  [14],  is  similar  to  the  DASH  project  in  that  it  attempts  to 
provide  for  scalable  computer  systems  that  rely  on  cache  coherent  processors.  It  differ.^  from  DASH  in 
that  it  is  a  bus  standard,  not  a  complete  system  design.  Thus,  it  only  specifies  the  interfaces  that  each 
processor  should  implement,  leaving  open  the  actual  processor  design  and  interconnection  between 
SCI  nodes.  This  difference,  together  with  the  evolving  nature  of  SCI  make  a  detailed  comparison 
impossible,  but  there  are  certain  general  areas  where  the  two  can  be  compared. 

The  SCI  architecture,  at  a  high  level,  is  similar  to  DASH  with  each  node  having  a  processor,  a 
portion  of  main  memory  and  t  interface  to  the  interconnection  network.  The  major  difference  is  in 
how  and  where  the  directory  ruormation  is  maintained.  In  SCI,  the  directory  is  a  distributed  sharing 
list  maintained  by  the  caches.  For  example,  if  processors  A,  B,  and  C  are  caching  some  location, 
then  the  lines  storing  this  location  in  these  processor’s  cache  form  a  doubly-linked  list.  Main  memory 
holds  only  a  pointer  to  the  processor  at  the  head  of  this  linked  list.  In  contrast,  DASH  places  all 
directory  memory  with  main  memory. 

The  main  advantage  of  the  SCI  scheme  over  DASH  is  that  the  amount  of  directory  pointer  storage 
grows  naturally  with  the  number  of  processors  actually  used  in  a  system.  In  DASH,  the  maximum 
number  of  processors  must  be  fixed  beforehand,  or  the  system  must  support  some  form  of  limited 
directory  information.  The  primary  disadvantages  of  the  SCI  scheme  are  due  to  the  added  complexity 
and  latency  required  to  manipulate  the  distributed  directory  entries.  In  addition  to  the  multitude  of 
interactions  present  in  DASH,  many  more  interactions  are  present  in  SCI  as  it  lacks  atomic  updates  to 
the  directory  entries,  More  importantly,  the  distributed  nature  of  the  SCI  directory  entries  inherently 
increases  the  latency  of  operations.  Again,  while  a  complete  evaluation  is  not  possible  at  this  time,  we 
feel  that  minimizing  latency  is  critical,  and  that  the  problems  of  maintaining  the  pointers  at  memory 
can  be  solved  through  schemes  using  a  limited  number  of  directory  pointers. 


6  Summary  and  Status 

In  this  paper  we  have  described  the  design  and  implementation  decisions  for  DASH,  a  scalable  shared- 
memory  multiprocessor  with  coherent  caches.  The  key  concepts  in  achieving  scalability  are  the  use 
of  a  distributed  directory-based  cache  coherence  protocol,  distributed  memories  and  directories,  and  a 
scalable  interccnnection  network.  The  design  optimizes  along  the  dimensions  of  scaling  the  memory 
bandwidth  and  reducing  the  latency  of  remote  memory  accesses.  We  have  also  discussed  several  other 
performance  enhancements  for  DASH.  These  include  mechanisms  for  reducing  and  tolerating  latency 
such  as  prefetch,  the  release  consistency  model  and  synchronization  primitives  such  as  queue-based 
locks. 

We  are  actively  proceeding  with  building  a  prototype  of  DASH.  At  the  time  of  writmg,  we  have 
completed  the  design  of  the  coherence  protocol  and  the  associated  state  machines,  and  we  have 
identified  the  modifications  necessary  to  the  existing  Silicon  Graphics  hsrdw'arc.  Wc  arc  currently 
at  the  schematic  entry  phase  for  the  directory  board.  The  interconnection  network  has  already  been 
designed  and  we  have  recently  received  the  network  chips  back  from  fabrication.  Our  goal  is  to  have 
a  prototype  with  16  processors  up  and  nmning  by  the  end  of  summer  of  1990. 

Iq  additioD  to  the  task  of  building  the  prototype,  wc  arc  putting  substantial  effort  in  developing 
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trace  generation  and  architecture  simulation  tools.  Tango  [11],  a  powerful  multiprocessor  emulator 
that  generates  address  traces,  is  coupled  with  simulators  for  DASH  and  other  architectures  to  provide 
detailed  performance  data.  We  are  also  using  the  DASH  simulator  for  validating  and  verifying 
(he  directory-based  coherence  protocol.  On  the  software  side,  we  are  working  on  a  parallelizing 
compiler,  developing  software  macro  libraries,  and  providing  an  operating  system  for  DASH.  The 
DASH  parallelizing  compiler  expands  on  the  work  previously  done  at  Stanford  on  partitioning  and 
scheduling  [19],  The  software  macro  libraries  will  incorporate  the  special  features  of  DASH  in 
providing  synchronization  and  scheduling  functions.  The  DASH  operating  system  will  be  either  an 
SGI’s  UNIX  System  V.3  or  a  version  of  MACH  tailored  to  our  machine. 

To  help  evaluate  and  focus  our  parallel  software  and  hardware  efforts,  we  are  also  working  on 
several  parallel  applications.  The  applications  include  distributed-time  digital  logic  simulation,  solution 
of  large  sparse  systems  of  equations,  standard  cell  routing,  ocean  simulation,  and  particle  simulation 
for  high-speed  flight.  To  support  DASH  application  development  and  optimization,  we  are  developing 
software  performance  monitoring  and  analysis  packages  that  couple  with  our  hardware  monitoring 
subsystem  to  provide  the  programmer  with  effective  performance  information  about  an  application. 


7  The  DASH  Team 

There  are  a  number  of  graduate  students  and  faculty  working  on  the  DASH  project  The  Ph.D.  students 
art:  Daniel  Lenoski,  Kourosh  Gharachorloo,  and  James  L.audon  (DASH  architecture  and  hardware 
design);  Wolf-Dietrich  Weber  (invalidation  patterns,  DASH  simulator,  scalable  directories);  Helen 
Davis  and  Stephen  Goldschmidt  (trace  generation  tools,  synchronization  patterns,  locality  studies); 
Margaret  Martonosi  and  Tom  Cbanak  (mesh  routing  chip  simulation  and  design);  Richard  Simoni 
(synthetic  load  generator  and  directory  studies);  Truman  Joe  (SGI  board  modifications  and  simulator 
for  Encore  Gigamax);  Josep  Torrellas  (sharing  patterns  in  applications);  Eric  William.s  and  Raul  Izahi 
Lopez  (SGI  board  modifications);  Todd  Mowiy  (evaluation  of  prefetch  operations);  Edward  Rothberg, 
Jaswiader  Pal  Singh,  Larry  Soule  (applications  and  algorithm  development);.  Bruce  Kleinman,  a 
master's  student,  is  working  on  verification  of  the  DASH  coherence  protocol.  Research  engineer 
David  Nakahira  has  also  contributed  to  the  hardware  design.  The  faculty  associated  with  the  project 
are  Anoop  Gupta,  John  Hennessy,  Mark  Horowitz,  and  Monica  Lam. 
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Abstract 

Capable  of  executing  multiple  scalar  operations  per  cycle,  a  superscalar  architecture  can  parallelize  not 
just  vectorizable  programs,  but  also  code  containing  recurrences  and  data  dependent  control  flow.  This 
paper  presents  an  overview  of  the  compiler  optimizations  that  are  crucial  in  harnessing  the  computation 
power  of  superscalar  machines.  These  optimizations  include  high-level  loop  transformations  to  fmd 
parallelism  and  improve  the  efficiency  of  caches,  software  pipelining  and  hierarchical  reduction 
techniques  for  scheduling  instructions,  and  modulo  variable  expansion  for  assigning  registers. 

Recent  advances  in  hardware  technology  have  made  superscalar  architectures  amenable  to  single-chip 
implementations.  The  combination  of  cheap  hardware  to  provide  a  high  raw  computing  power  and 
sophisticated  compiler  technology  to  effectively  use  the  parallelism  can  produce  extremely  low-cost, 
high-performance  woitstations  that  are  easily  accessible  to  the  general  scientific  and  engineering 
community. 

1.  Introduction 

A  superscalar  computer  is  a  uniprocessor  that  can  execute  two  or  more  scalar  operations  in  parallel. 
The  operations  are  individually  specified  in  the  object  code;  this  is  distinct  from  vector  machines  which 
expand  vector  instructions  into  series  of  parallel  operatioos.  The  parallelism  of  a  vector  instruction  is 
defined  for  each  vector  machine  at  machine  design  time;  on  a  superscalar  machine,  a  parallel  execution 
schedule  is  created  uniquely  for  each  program,  by  either  hardware  or  software.  As  a  result,  superscalar 
machine  organizations  are  more  versatile  and  effective  in  using  the  hardware  resources  in  the  system. 

Superscalar  machines  existed  long  before  the  term  was  coined.  The  IBM  Stretch  [5],  the  CDC 
6600  [24]  and  the  IBM  360/91  [2]  are  all  superscalar  architectures  that  can  execute  multiple  operations  in 
parallel.  These  machines  all  implement  a  sequential  instruction  set  with  hardware  that  schedules  the 
instructions  dynamically.  Besides  hardware,  software  has  also  been  used  for  instruction  scheduling. 
Epitomizing  the  class  of  superscalar  machines  that  rely  on  software  for  scheduling  instructions  is  the 
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VLIW  (Very  Long  Instruction  Word)  architecture  [13).  Each  wide  instruction  word  explicitly  speciiles 
the  operations  to  be  executed  in  parallel.  Examples  of  such  machines  include  the  Multiflow’s  Trace 
machines  [8],  the  Carnegie  Mellon’s  processors  for  the  Warp  systolic  array  [3]  and  the  Cydrome’s  Cydra 
5  [9].  The  recent  hardware  technology  advances  have  made  software  scheduled  superscalar  architectures 
amenable  to  single-chip  implementations.  A  follow-on  of  the  Warp  processor,  the  Carnegie  Mellon  and 
Intel’s  iWaip  processor  integrates  high-performance  computation  and  systolic  communication  in  a  single 
component  [6].  The  Intel’s  i860  is  a  single-chip  microprocessor  that  can  perform  up  to  100  million 
floating-point  per  seconds  (MFLOPS)  using  a  dual-instruction  word  format  [17]. 

The  development  of  the  recent  superscalar  architectures  presents  an  exciting  prospect  to  the  engineering 
and  scientific  community.  As  technology  improves,  the  superscalar  processor  performance  is  expected  to 
grow.  The  superscalar  processor  provides  a  more  flexible  form  of  instruction  parallelism  in  a  low-cost 
package.  The  impact  is  that  high  computing  power  can  be  easily  provided  in  a  low-cost  desktop 
workstation  that  is  widely  accessible  to  engineers  and  scientists.  The  high-level  of  integration  also  makes 
these  scalar  processors  a  useful  building  block  for  large-scale  multiprocessing,  thus  delivering  an 
aggregate  computation  bandwidth  higher  than  ever  before. 

The  parallelism  of  a  superscalar  machine  may  be  managed  in  hardware  or  software.  The  hardware 
approach  schedules  the  instructions  dynamically,  thus  hiding  parallelism  from  the  architecture.  The 
instruction  set  architecture  can  therefore  be  made  compatible  with  that  of  an  existing  sequential  machine. 
Run-time  scheduling,  however,  requires  more  hardware  logic,  which  may  result  in  a  slower  clock  cycle  or 
longer  latency  in  instruction  execution.  In  the  software  approach,  the  parallelism  is  exposed  at  the 
architecture  level,  and  the  compiler  is  responsible  for  specifying  the  parallel  operations  to  execute.  By 
analyzing  the  entire  program  statically,  the  compiler  can  exploit  higher  level  program  semantics  and 
rearrange  the  code  globally  to  derive  a  better  schedule. 

To  harness  the  raw  computing  speeds  of  software-scheduled  superscalar  processor  in  applications, 
compiler  technology  is  crucial.  The  compiler  hides  the  parallelism  from  the  programmer,  so  the 
programmer  can  develop  applications  easily  using  a  high-level  sequential  language.  This  approach  has 
the  additional  advantage  that  the  same  sequential  programs  can  now  easily  be  ported  to  other  cturent  and 
future  machine  architectures. 

In  this  paper,  we  first  describe  the  characteristics  of  the  superscalar  architecture  and  the  issues  in 
compiling  code  for  such  machines.  We  then  present  a  set  of  compiler  optimizations,  showing  how  the 
fimctionali^  of  the  processor  can  be  used  in  programs.  We  then  close  with  a  discussion  on  the 
performance  of  these  superscalar  machines. 

2.  Superscalar  Architectures 

Common  to  all  superscalar  processors  is  the  presence  of  parallel  and/or  pipelined  functional  units.  Like 
any  machine  that  employs  parallelism  and  pipelining,  a  program  running  on  a  superscalai  seldom 
achieves  the  peak  compulation  rate  of  the  machine.  If  a  superscalar  processor  has  n  functional  units,  or  a 
functional  unit  with  n  pipeline  stages,  n  independent  operations  must  be  present  at  all  times  to  utilize  the 


3 


machine  fully.  If  no  parallelism  is  found,  the  machine  may  operate  at  1/nth  of  the  peak  rate.  Therefore, 
for  a  superscalar  to  be  effective,  it  is  important  that  the  scheduler  can  find  enough  independent  operations 
to  execute  in  parallel. 

Before  we  discuss  the  scheduling  techniques,  let  us  first  take  a  look  at  the  fimdamental  limit  the 
hardware  imposes  on  the  execution  speed  of  a  program.  Even  if  there  are  enough  independent  operations, 
the  full  computation  power  of  a  superscalar  may  not  be  brought  to  bear  on  an  application  because  of 
specialization.  The  processor  typically  consists  of  a  set  of  specialized  functional  units,  some  memory 
access  units,  possibly  different  arithmetic  units,  and  an  instruction  branch  control  unit.  For  example,  a 
program  that  requires  no  multiplications  will  not  be  able  to  take  advantage  of  the  multiplication  unit  on 
the  processor. 

The  hardware  of  a  system  is  typically  designed  such  that  the  distribution  of  the  computational  units 
matches  the  disnibution  of  operations  in  a  typical  program.  From  the  statistics  of  a  large  set  of  numerical 
applications  [18],  we  have  observed  that  there  are  about  as  many  floating-point  arithmetic  operations  as 
memory  operations.  About  60%  of  the  memory  operations  are  read  operations,  and  about  70%  of  the 
floating-point  operations  are  additions.  On  a  machine  that  can  execute  one  memory  read,  one  memory 
write,  one  floating-point  addition,  and  one  floating-point  multiplication  in  a  single  cycle,  the  adder  is 
often  the  critical  resource  and  is  followed  by  the  memory  read  unit 

Besides  the  utilization  of  the  functional  units  in  a  processor,  it  is  also  important  to  consider  the  memory 
subsystem.  To  support  a  high  computation  bandwidth,  a  processor  must  also  have  a  similarly  powerful 
memory  subsystem.  For  a  vector  machine,  the  more  restricted  mode  of  operation  permits  the  use  of 
vector  registers  and  efficient  block  transfers  between  the  memory  subsystem  and  the  registers.  Being  able 
to  support  a  less  regular  form  of  parallelism,  a  superscalar  architecture  requires  a  more  flexible  memory 
system.  The  concept  of  memory  hierarchy  has  been  shown  to  be  useful  in  reducing  the  average  access 
latencies  for  general-purpose  machines.  A  cache  can  also  reduce  the  number  of  memory  accesses  which 
can  be  important  in  a  multiprocessor  environment 

Unfortunately,  a  cache  sometimes  behaves  rather  poorly  for  numerical  code.  Because  of  the  large  dau 
set  used,  data  brought  into  the  cache  may  be  flushed  out  before  they  are  reused.  The  cache  bit  rate  can 
fluctuate  widely  depending  on,  for  example,  whether  a  matrix  operand  is  in  the  cache.  This  may  greatly 
affect  the  overall  speed  obtained  due  to  the  large  difference  between  cache  and  memory  speeds.  While  a 
cache  is  normally  transparent  to  compilers  for  general-purpose  programs,  it  is  beneficial  to  optimize  the 
cache  behavior  in  superscalar  compilers. 

In  many  ways,  a  superscalar  compiler  faces  similar  issues  as  those  of  a  vectorizing  compiler.  The 
compiler  must  extract  parallelism  from  sequential  programs  and  try  to  use  the  parallel,  specialized 
functional  units  effectively.  The  compiler  must  also  manage  the  cache;  this  is  analogous  to  the 
management  of  vector  registers  in  vectorizing  compilers.  Though  the  issues  are  similar,  a  superscalar 
machine  presents  new  challenges  to  compiler  optimization.  The  parallelism  must  be  managed  at  the 
scalar  operation  level  and  the  paialleiism  exploitable  is  not  regular  like  vector  instructioas. 
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3.  Overview  of  Compiler  Techniques 

There  are  two  levels  of  compiler  optimization:  the  loop  level  and  the  instruction  level.  The  loop  level 
involves  higher  level  transformations  on  the  loop  structure.  These  transformations  are  useful  both  for 
bringing  parallelism  to  the  innermost  loop  as  well  as  improving  data  locality.  This  high-level 
restructuring  prepares  the  loop  for  low-level  instruction  scheduling. 

The  instruction  level  optimization  consists  of  instruction  scheduling  and  register  assignment 
techniques.  The  scheduling  problem  is  to  fmd  the  shortest  instruction  schedule  that  satisfies  the 
constraints  imposed  by  the  machine  resources  and  the  program  semantics.  In  particular,  since  most  of  the 
computation  time  is  spent  on  innermost  loops,  it  is  important  to  schedule  such  loops  efficiently.  Software 
pipelining  is  a  scheduling  technique  that  exploits  the  repetitive  nature  of  innermost  loops  to  generate 
highly  efficient  code  for  processors  with  parallel,  pipelined  functional  units  [19, 22, 25],  Another  code 
scheduling  technique  used  with  software  pipelining  is  hierarchical  reduction,  a  technique  that  abstracts 
control  constructs  as  operations  in  a  basic  block,  so  the  same  scheduling  algorithms  can  be  applied  to 
within  and  across  basic  blocks.  For  example,  using  hierarchical  reduction,  software  pipelining  can  be 
applied  to  all  innermost  loops,  including  those  containing  conditional  statements.  Hierarchical  reduction 
makes  it  possible  to  obtain  a  consistent  performance  improvement  for  many  more  programs.  Interacting 
with  code  scheduling  is  register  assignment.  When  the  same  register  is  assigned  to  different  variables, 
their  uses  must  be  serialized,  thus  constraining  the  parallelism  in  the  computation.  Therefore,  the  register 
assignment  must  also  be  considered  hand-in-hand  with  instruction  scheduling. 

In  the  following,  we  first  present  an  overview  of  the  analysis  techniques  necessary  to  support  both  loop 
level  and  instruction  level  parallelism.  We  then  discuss  each  of  the  optimizations;  loop  level 
transformations,  software  pipelining,  hierarchical  reduction  and  register  assignment. 

Program  semantics  produces  two  kinds  of  constraints;  control  dependences  and  data  dependences.  A 
conditional  branch  instruction  must  first  be  executed  to  determine  the  instruction  to  execute  next  This 
sequencing  constraint  is  known  as  control  dependence.  An  operation  cannot  execute  until  all  its  operands 
are  produced.  This  sequencing  constraint  is  known  as  true  data  dependence.  To  ensure  that  a  read 
operation  always  reads  the  latest  value  produced,  the  order  of  the  write  operations  on  the  same  location 
must  also  be  observed.  This  sequencing  constraint  is  known  as  ouipu;  <ie/7em2ence.  Furthermore,  since  a 
data  location  may  hold  different  values  at  different  times,  a  value  must  not  be  overwritten  before  its  use. 
This  form  of  data  dependence  is  known  as  anti-dependence. 

The  compiler  must  first  extract  dependence  constraints  from  the  program.  The  analysis  algorithms  are 
similar  to  those  previously  used  for  vectorizing  and  concuirentizing  compilers.  The  control  dependence 
can  either  be  obtained  through  analysis  of  the  flow  graph  [11],  or  simply  retained  from  the  syntactic 
control  structure  of  the  program  [16].  For  data  dependence,  since  array  references  are  very  common  in 
numerical  code,  it  is  important  to  determine  if  two  array  references  can  refer  to  the  same  location,  and 
thus  may  share  a  dependence  relationship  between  them.  Various  dependence  tests  have  been  proposed 
for  disainbigiiating  between  array  references  whose  indices  are  an  affine  function  of  loop 
indices  [1,4, 27]. 
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The  dependence  information  was  used  previously  only  for  source-to>source  loop  transformations.  For  a 
superscalar  machine,  this  information  is  used  at  both  the  loop  and  instruction  level.  In  the  compiler 
currently  developed  at  Stanford,  data  dependence  is  capnired  in  an  intermediate  representation  that 
supports  loop  level  transformations,  and  this  same  information  can  be  used  in  the  code  generation  phase. 

4,  Loop  Level  Transformations 

High  level  code  transformations  are  useful  in  bringing  parallelism  into  the  innermost  loop,  as  well  as 

improving  the  efficiency  of  the  caches.  Consider  the  simple  example  of  a  matrix  multiplication: 

FOR  i  :■  0  TO  n-1  DO 
FOR  j  0  TO  n-1  DO 

FOR  k  0  TO  n-1  DO 

C[i,jj  A[i,k] "Blk, j]+Cti, j] ; 

The  result  of  one  addition  is  used  by  the  addition  in  the  next  iteration  of  the  loop.  The  additions  must 

therefore  execute  sequentially;  with  an  n-stage  pipelined  adder,  an  iteration  takes  at  least  n  clocks.  The 

multiplications,  being  independent,  can  execute  in  parallel  with  the  additions.  (Unlike  a  vector  machine, 

a  superscalar  machine  can  execute  some  instructions  in  parallel  even  for  recurrences.)  To  further  increase 

the  utilization  of  the  machine,  the  compiler  must  perform  higher  level  transformations  so  as  to  expose 

more  parallelism  in  the  innermost  loop  to  the  instruction  scheduler.  In  this  example,  if  the  inner  two 

loops  are  interchanged  as  follows: 

FOR  i  :*  0  TO  n-1  DO 
FOR  k  0  TO  n-1  DO 

FOR  j  :«  0  TO  n-1  DO 

C[i, j]  A[i,k]*B[k, j]+C[i, j]; 

The  iterations  in  the  innermost  loop  are  now  independent;  as  many  iterations  as  necessary  can  execute  in 
parallel  to  fully  utilize  the  hardware  resources  of  the  machine.  Therefore,  when  the  innermost  loop  does 
not  contain  enough  parallel  operations  to  keep  the  hardware  resources  busy,  high  level  transformations, 
similar  to  those  used  in  vectorizing  and  parallelizing  compilers,  should  be  applied. 

For  superscalar  machines  with  caches,  high  level  transformations  can  also  be  used  to  improve  overall 

performance  by  reducing  the  cache  miss  rate.  Consider  a  machine  whose  cache  is  relative  small  in 

comparison  with  the  matrix  size.  The  objective  of  the  optimization  is  to  minimize  memory  accesses  by 

reusing  data  in  the  cache  as  much  as  possible.  In  the  optimized  program  above,  the  innermost  loop 

accesses  rows  k  and  i  of  matrices  B  and  C,  respectively.  The  same  row  of  C  is  used  in  the  next  outer 

loop,  but  the  B  data  will  not  be  reused  until  the  next  iteration  in  the  outermost  loop.  If  the  data  size  is 

large  compared  to  the  cache,  even  the  C  data  may  not  be  in  the  cache,  let  alone  the  B  data.  Maximum 

reuse  is  obtained  if  we  can  block,  or  tile,  the  computation  as  follows: 

FOR  ii  :■  0  TO  n-1  BY  b  DO 
FOR  kk  0  TO  n-1  BY  b  DO 

FOR  jj  :»  0  TO  n-1  BY  b  DO 

FOR  i  :■  ii  TO  niin(ii+b-l/  n)  DO 

FOR  k  kk  TO  min(kk+b-l,  n)  DO 

FOR  j  jj  TO  niin(jj+b-l,  n)  DO 

C[i,j]  A[i,k]»B[k,  j]+C[i, j] ; 

Each  of  the  matrix  elements  brought  into  the  cache  is  reused  b  times  before  it  is  removed  from  the  cache. 
The  value  of  b  is  chosen  to  maximize  the  cache  utilization. 
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Previous  research  on  data  locality  has  provided  ways  to  predict  the  cache  behavior  of  a  loop  nest. 
Gannon  et  al.  [14]  use  uniformly  generated  references  to  find  where  locality  exists  in  a  nesting  of  loops. 
They  also  discuss  how  to  choose  which  array  elements  should  go  into  the  cache  for  a  given  loop. 
Porterfield  [21]  estimates  cache  behavior  for  a  loop  nest  assuming  that  the  cache  uses  the  least  recently 
used  replacement  policy,  and  may  block  a  loop  if  the  cache  cannot  hold  all  the  data  in  an  iteration. 
Gannon  et  al.’s  and  Porterfield’s  estimates  can  be  used  to  evaluate  the  data  locality  of  entire  loop  nests 
obtained  by  different  sets  of  transformations. 

Loop  transformations  beneficial  to  data  locality  and  parallelism  for  superscalar  machines  include  loop 
interchange,  reversal,  skewing  and  tiling.  Wolf  and  I  have  developed  an  efficient  algorithm  to  search 
through  the  space  of  these  transformations  and  generates  code  that  displays  data  locality  and  parallelism 
in  the  innermost  loops  [26].  We  reduce  the  optimization  problem  to  placing  the  maximum  number  of 
loops  identified  to  carry  locality  in  the  irmermost  tile.  Using  this  goal  and  the  legality  considerations  of 
tiling,  we  can  significantly  prune  the  search  space  to  find  the  best  set  of  transformations.  How  tiling 
improves  data  locality  has  been  illustrated  by  the  example  above.  The  conditions  that  made  tiling  legal  in 
the  first  place  guarantee  both  coarse  and  fine  grain  parallelism  within  a  tiled  loop.  Therefore,  by  tiling  the 
loops,  we  generate  code  that  exhibits  both  data  locality  and  parallelism. 

5.  Software  Pipelining 

After  performing  the  high-level  transformations,  the  compiler  can  then  apply  the  instruction  level 
optimizations.  The  basic  technique  for  obtaining  parallelism  is  software  pipelining.  Let  us  introduce  the 
concept  of  software  pipelining  by  way  of  an  example.  Suppose  we  have  a  machine  that  can  perform  one 
load,  one  store,  and  initiate  a  7-stage  pipelined  floating  operation  in  one  instruction,  and  suppose  the  code 
we  want  to  execute  is: 

FOR  i  1  TO  n  DO 
A(i]  !-  A[i]+1.0; 

Assume  for  now  that  we  can  generate  the  addresses  for  the  loads  and  stores  in  parallel  with  the  rest  of  the 
computation;  the  specifics  of  this  topic  will  be  discussed  in  Section  7,  The  most  compact  instruction 
sequence  to  execute  a  single  iteration  of  this  loop  is  given  in  Figure  5-1.  The  operation  BLoop  1 
branches  back  to  label  I  if  there  are  more  iterations  to  execute.  The  schedule  is  sparse  due  to  the  heavy 
pipelining  in  the  data  path.  (For  machines  with  hardware  interlocK^,  'he  nop  instructions  are  used  only  at 
code  scheduling  time;  they  are  omitted  when  the  code  is  emitted.)  If  we  simply  iterate  this  schedule,  the 
throughput  of  the  loop  is  only  1  iteration  every  9  clock  ticks,  and  no  resources  are  used  more  than  l/9th  of 
the  time, 

1:  LD 
FADD 
nop 
nop 
nop 
nop 
nop 

iiOp 

ST  BLoop  1 

Figure  5-1:  Object  code  for  one  iteration  in  example  program 
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Techniques  such  as  trace  scheduling  [12]  depend  on  loop  unrolling  to  generate  enough  parallel 
instructions  to  scb»  uppose  the  loop  body  of  the  example  is  unrolled  9  times,  the  optimal  schedule 

of  the  body  of  A  loop  is  shown  in  Figure  5-2.  (This  instruction  sequence  assumes  that  the 

number  of  iterauons  is  divisible  by  9.)  Each  row  in  the  figure  corresponds  to  operations  in  an  instruction, 
and  each  column  corresponds  to  the  computation  of  one  iteration  of  the  loop  in  the  source  program. 
Unrolling  the  loop  9  times  improves  the  throughput  to  9  iterations  every  17  clocks.  From  the  figure,  it  is 
clear  that  unrolling  an  additional  iteration  will  only  lengdien  the  schedule  by  one  clock.  This  can  be  kept 
up  until  the  iterations  run  out  A  loop  unrolled  u  times  will  have  a  throughput  rate  of  u/{u+9)  iterations 
per  clock,  while  the  ideal  throughput  is  1  iteration  per  clock. 

1:  LD 
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FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 
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ST 

ST 

ST 

ST 

ST 

ST  Sloop  1 

Figure  5-2:  Optimal  schedule  for  nine  iterations 

Although  the  schedule  improves  as  we  unroll  more  iterations,  code  expansion  limits  the  degree  of 
unrolling.  Unrolling  can  therefore  overlap  only  a  small  fmite  number  of  iterations;  all  the  unrolled 
iterations  must  complete  before  the  program  branches  back  to  another  set  of  contiguous  iterations.  On  a 
vector  machine,  such  a  loop  maps  directly  into  a  vector  instruction;  a  vector  instruction  can  continually 
overlap  operations  from  successive  iterations  to  delivw  a  throughput  of  one  iteration  per  clock  cycle. 

Software  pipelining  can  achieve  the  same  kind  of  performance  obtained  with  vector  instructions  by 
continually  overlapping  operations  from  different  iterations  of  a  loop.  The  software  pipelined  program 
for  the  example  above  is  shown  in  Figure  5-3.  Code  generated  by  software  pipelining  is  compact  The 
code  in  the  figure  assumes  that  there  arc  at  least  nine  iterations  in  the  loop.  The  first  eight  instructions 
constitute  the  prolog  where  more  and  more  iterations  of  the  loop  start  to  execute.  Ilie  steady  state  is 
reached  after  eight  instructions,  and  is  repeated  until  all  iterations  have  been  initiated.  In  the  steady  state, 
nine  iterations  are  in  progress  at  the  same  time,  with  one  iteration  starting  up  and  one  finishing  off  every 
clock.  On  leaving  the  steady  state,  the  iterations  currently  in  progress  are  completed  in  the  epilog,  the 
10th  through  17th  instructions.  This  program  achieves  the  optimal  computation  time  by  executing  n 
iterations  in  n-f-S  clock  ticks,  where  n  is  the  uumber  of  iterations  in  the  loop. 

Software  pipelining  is  different  from  loop  unrolling  in  that  a  source  iteration  may  span  one  or  more 
iterations  in  the  object  code.  If  the  machine  contains  pipelined  functional  units,  the  pipeline  stages  need 
not  be  emptied  at  iteration  boundaries.  In  the  example  above,  seven  additions  initiated  in  seven  different 
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Figure  5-3:  Program  of  a  software  pipelined  loop 

iterations  execute  in  parallel.  The  hardware  pipelines  are  filled  and  drained  only  once  on  entering  and 
exiting  the  loop,  respectively.  Software  pipelining  is  especially  beneficial  for  machines  with  high  degrees 
of  parallelism  and  specialization.  The  results  are  that  optimal  throughput  can  be  achieved,  and  achieved 
with  an  extremely  compact  program. 


5.1.  The  Problem 

In  this  section,  we  first  concentrate  on  the  scheduling  of  loops  containing  a  single  basic  blocks. 
Extending  software  pipelining  to  other  loops  is  discussed  with  hierarchical  reduction  in  the  next  section. 
The  piimaiy  goal  of  software  pipelining  is  to  maximize  the  throughput  in  executing  the  iterations;  it  does 
not  matter  if  the  execution  time  of  individual  iterations  is  lengthened.  Its  secondary  goal  is  to  keep  the 
code  size  down.  In  other  words,  the  schedule  must  have  a  short  steady  state  so  that  it  can  be  captured  in  a 
relatively  succinct  code  sequence,  llie  problem  is  thus  formulated  as  finding  a  common  schedule  for  all 
iterations  of  the  source  loop,  such  that  successive  iterations  are  initiated  with  a  constant  interval,  and  the 
objective  is  to  minimize  this  interval.  In  the  example  above,  the  schedule  of  an  iteration  is  given  in 
Figure  5-1,  and  the  iteration  initiation  interval  is  one. 

Software  pipelining  was  criginally  derived  from  a  technique  for  scheduling  hardware  pipelines,  where 
the  problem  was  formulated  as  inserting  delays  between  hardware  units  to  increase  the  overall  throughput 
of  the  system  [20].  New  input  is  accepted  by  the  hardware  pipeline  at  regular  periodic  inter/als.  The 
software  analog  is  to  schedule  operations  within  an  iteration  such  that  the  iterations  can  be  pipelined  to 
yield  optimal  throughput 

Software  pipelining  has  been  used  in  compilers  for  several  different  architectures.  The  algorithm  was 
first  used  in  the  ESL  polycyclic  architecture  [22],  The  polycyclic  machine  uses  a  specialized  crossbar  to 
simplify  the  scheduling  problem  for  a  subset  of  loops  [23].  The  same  concept  is  also  implemented  in 
Cydrome’s  Cydra  5  [9],  Software  pipelining  is  also  used  in  the  compiler  for  the  FPS-164  machine  [25]. 
The  n'’S-164  does  not  have  any  specialized  support  for  software  pipelining,  and  software  heuristics  are 
used  to  schedule  the  loops.  We  improved  upon  the  FPS  heuristics,  especially  in  the  algorithm  for 
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scheduling  recurrences,  and  implemented  them  in  our  compilers  [7, 19]  for  the  Carnegie  Mellon’s  Warp 
and  iWarp  machines.  Eisenbeis  et  al.  applied  software  pipelining  to  the  problem  of  scheduling  vector 
instructions,  and  implemented  a  compiler  that  generates  software  pipelined  vector  code  for  the  Cray-2 
architecture  [10], 

Let  us  first  describe  some  of  the  fundamental  limits  in  scheduling  a  loop.  There  are  two  kinds  of 
constraints:  resource  and  precedence  constraints. 

Resource  Constraints.  Suppose  a  machine  has  m(r)  units  of  resource  r,  and  an  iteration  of  a  loop 
requires  n{r)  units  of  resource  r,  then  a  pipelined  loop  cannot  execute  faster  than  the  rate  of  at  most  one 
iteration  every 

in&x 

^  m(r) 

cycles.  This  equation  reconfirms  the  notion  that  it  is  harder  to  fully  utilize  highly  specialized  functional 
units  and  the  computation  rate  is  limited  by  the  resource  with  the  highest  demand. 

In  software  pipelining,  we  must  ensure  that  the  resource  commitment  in  each  clock  cycle  of  the  steady 
state  does  not  exceed  the  available  resources.  The  resource  usage  of  the  steady  state  can  be  represented 
by  a  modulo  resource  reservation  table  whose  ith  entry  contains  the  sum  of  the  resources  used  in  cycles 
i+2s, ...  of  the  schedule  of  an  iteration,  where  s  is  the  initiation  interval  of  the  loop. 

Precedence  constraints.  While  recurrences  limit  the  throughput  of  the  computation,  a  superscalar, 
unlike  a  vector  machine,  can  often  still  fmd  some  parallelism  in  such  loops.  Consider  the  following 
example; 

FOR  i  1  to  ICO  DO 
a  :  ■>  a  +  1.0; 

We  must  first  read  a  before  we  write  back  into  a  in  the  same  iteration,  which  in  turn  must  precede  the 
read  operation  in  the  next  iteration.  The  flow  graph  representing  the  above  example,  assuming  a  seven- 
staged  addition,  is  shown  in  Figure  5-4.  Each  edge  is  labeled  by  the  number  of  iterations  the  dependence 
crosses  and  the  delay  between  them.  As  shown  in  the  figure,  inter-iteration  data  dependences  may 
introduce  cycles  into  the  precedence  constraint  graph.  The  precedence  constraints  in  Figure  5-4  impose  a 
delay  of  9  clock  ticks  between  load  operations  from  consecutive  iterations.  That  is,  loops  cannot  execute 
at  a  rate  greater  than  one  iteration  every  9  clocks. 

We  defme  the  minimum  delay,  d,  and  minimum  iteration  difference,  p,  of  a  path  to  be  the  sum  of  the 
minimum  delays  and  minimum  iteration  differences  of  the  edges  in  the  path,  respectively.  If  we  let  c 
denote  a  cycle  in  the  graph,  the  rate  at  which  the  iterations  can  be  executed  is  one  iteration  every 

rd(c)1 


pic) 


cycles. 


The  maximum  oi  the  two  bounds  determined  by  resource  and  precedence  considerations  establishes  a 
lower  bound  on  the  initiation  interval.  Therefore,  a  schedule  that  pipelines  with  an  initiation  interval 
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Figure  5-4:  (a)  Delays  between  operations  from  two  iterations,  and  (b)  precedence  graph 

meeting  the  lower  bound  is  optimal.  Empirical  results  show  that  this  lower  bound  can  indeed  be  met  in 
many  cases  [18j. 

5.2.  The  Algorithm 

The  problem  of  finding  the  optimal  software  pipeline  schedule  is  NP-complete.  For  acyclic  graphs,  the 
scheduling  problem  is  tractable  if  operations  execute  in  unit  time  and  use  only  one  resource.  The 
polycyclic  architecture  [22]  and  the  Cydra  5  architecture  [9]  use  a  specialized,  rather  expensive  crossbar 
to  provide  exactly  that  property.  All  functional  units  of  a  polycyclic  machine  are  interconnected  through 
a  crossbar.  This  aossbar  has  storage  at  every  crosspoint  to  serve  as  a  dedicated  buffer  for  each  pair  of 
functional  units.  Therefore,  there  is  never  any  contention  in  reading  or  writing  data.  Each  operation  thus 
consumes  only  one  explicitly  scheduled  resource.  For  acyclic  graphs,  the  minimum  initiation  interval  is 
given  by  the  bound  discussed  above  and  an  optimal  schedule  can  easily  be  found.  However,  the  problem 
remains  NP-complete  for  cyclic  graphs  even  if  operations  use  one  unit  of  resource  and  execute  in  one  unit 
time. 

Without  the  sptecialized  hardware  to  support  software  pipelining,  both  the  FPS  and  the  Warp/iWarp 
compilers  use  software  heuristics.  The  algorithms  used  for  scheduling  acyclic  graphs  are  similar,  but  the 
cyclic  graph  scheduling  algorithm  is  significantly  improved  in  our  Warp/iWarp  compilers.  The  algorithm 
for  acyclic  graphs  is  as  follows:  First,  establish  a  lower  and  an  iqjper  bound  on  the  initiation  interval. 
The  lower  bound  is  calculated  from  the  resource  and  precedence  constraints;  the  upper  bound  can  be 
found  by  the  schedule  of  a  single  loop  iteration.  Next,  find  the  smallest  initiation  interval.  Simple  linear 
search  is  used  in  our  Warp/1  Warp  compiler  because  empirical  results  show  that  a  schedule  meeting  the 
lower  bound  can  often  be  found.  The  algorithm  first  sets  the  target  of  the  initiation  interval  to  be  the 
lower  bound  value,  and  attempts  to  find  a  pipelinable  schedule  for  the  target  initiation  interval  using  the 
method  described  below.  If  the  attempt  fails,  this  process  is  reiterated  by  increasing  the  target  initiation 
interval  by  one  clock  tick  at  a  time. 


The  basic  algorithm  used  to  find  a  software  pipelinable  schedule  for  a  target  initiation  is  list  scheduling. 
In  list  scheduling,  the  precedence  constraints  are  applied  first  to  determine  the  carlic,st  slot  in  which  au 
operation  can  be  scheduled.  The  scheduler  then  goes  on  to  try  to  satisfy  the  resource  constraints;  the 


11 


modulo  resource  reservation  table  defined  above  is  used  to  determine  if  there  is  a  resource  conflict  The 
scheduler  tries  to  schedule  the  operation  in  successive  time  slots  until  one  that  can  accommodate  its 
resource  requirement  is  found.  If  s  is  the  target  initiation  interval,  and  s  attempts  to  satisfy  the  resource 
constraints  fail,  by  the  definition  of  modulo  resource  usage,  this  operation  cannot  fit  into  the  schedule 
built  so  far.  When  this  happens,  the  attempt  to  fmd  a  schedule  for  the  given  initiation  interval  is  aborted 
and  the  scheduling  process  is  repeated  with  a  greater  interval  value. 

As  in  the  case  of  acyclic  graphs,  the  main  scheduling  step  for  cyclic  graphs  is  iterative.  For  each  target 
initiation  interval,  the  strongly  connected  components  are  first  scheduled  individually.  The  original  graph 
is  then  reduced  by  representing  each  strongly  connected  component  as  a  single  vertex;  the  resource  usage 
of  the  vertex  represents  the  aggregate  resource  usage  of  its  components,  and  edges  connecting  nodes  from 
different  connected  components  are  represented  by  edges  between  the  corresponding  vertices.  This 
reduced  graph  is  acyclic,  and  the  acyclic  graph  scheduling  algorithm  can  then  be  applied. 

Two  main  concepts  are  used  in  the  algorithm  for  scheduling  the  strongly  connected  components.  First, 
the  precedence  constraints  are  preprocessed  so  that  the  scheduler  can  easily  determine  the  legal  time  span 
in  which  any  node  can  be  scheduled.  Second,  the  order  in  which  the  instructions  are  scheduled  is 
designed  such  that  when  the  target  initiation  interval  value  is  increased,  the  chance  for  success  also 
improves.  This  is  important  because  it  would  be  futile  if  the  scheduling  algorithm  simply  retried  the  same 
schedule  that  failed. 

A  large  set  of  evaluation  data  on  the  Warp/iWarp  machine  indicates  that  provably  optimal  schedules 
can  often  be  found  [18].  This  shows  that  software  pipelining  does  not  require  expensive  hardware 
support  The  code  generated  is  compact  the  body  of  a  software  pipelined  loop  is  even  shorter  than  the 
unoptimized  code. 

6.  Hierarchical  Reduction 

The  hierarchical  reduction  technique  is  designed  to  make  software  pipelining  applicable  to  all 
innermost  loops,  including  those  containing  conditional  statements.  The  proposed  approach  schedules  the 
program  hierarchically,  starting  with  the  innermost  control  constructs.  As  each  coiutruct  is  scheduled,  the 
entire  construct  is  reduced  to  a  simple  node  representing  all  the  scheduling  constraints  of  its  components 
with  other  constructs.  This  node  can  then  be  scheduled  just  like  a  simple  node  within  the  surrounding 
control  construct  The  scheduling  process  is  complete  when  an  entire  program  is  reduced  to  a  single 
node. 

The  use  of  the  construct  structure  exploits  high-level  control  dependence  knowledge  [11]  to  increase 
the  opportimity  for  code  motion.  As  an  example  of  the  kind  of  code  motions  achievable  with  tliis 
technique,  consider  the  following  program: 

FOB  i  0  to  n  DO 

BEGIN 

statement  1; 

IF  c  THEN  statement  2  ELSE  atacement  3; 

statement  4; 

END 
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Altbough  statement  4  comes  after  the  conditional  statement,  it  is  not  control  dependent  upon  the  result  of 
the  condition  c.  Once  the  program  decides  to  execute  another  iteration,  it  can  execute  statements  1  and  4 
in  any  order  that  satisfies  the  data  dependences.  For  example,  an  operation  in  statement  4  can  be  executed 
before  the  conditional  statement.  The  hierarchical  reduction  algorithm  first  schedules  the  then  and 
ELSE  parts  of  the  conditional  statement,  and  represents  the  entire  construct  with  a  single  node  that 
inherits  the  union  of  the  scheduling  constraints  for  each  of  the  two  parts  of  the  conditional  statement.  The 
entire  construct  is  then  scheduled  with  statements  1  and  4.  Operations  corresponding  to  statements  1  and 
4  may  be  reordered,  they  may  also  execute  in  parallel  with  the  THEN  and  ELSE  components  of  the 
conditional  statement.  At  code  emission  time,  any  code  scheduled  in  parallel  with  the  conditional 
statement  is  duplicated  in  both  the  THEN  and  ELSE  parts. 

This  control  dependence  knowledge  when  combined  with  software  pipelining  can  produce  surprisingly 
efficient  code.  The  loop  termination  test  for  the  next  iteration  can  be  performed  immediately  after  the 
decision  to  execute  the  current  iteration.  This  test  can  move  past  ail  the  conditional  branches  in  the  body 
of  the  loop.  In  this  way,  hierarchical  reduction  exposes  many  more  parallel  operations  for  scheduling. 

Hierarchical  reduction  also  minimizes  the  penalty  of  short  vectors,  or  loops  with  small  number  of 
iterations.  The  prolog  and  epilog  of  a  loop  can  be  overlapped  with  scalar  operations  outside  the  loop;  the 
epilog  of  a  loop  can  be  overlapped  with  the  prolog  of  the  next  loop;  lastly,  software  pipelining  can  be 
applied  even  to  an  outer  loop.  In  summary,  hierarchical  reduction  makes  it  possible  to  exploit  parallelism 
in  a  much  larger  set  of  applications.  It  allows  loops  containing  conditional  statements  to  be  software 
pipelined,  and  it  finds  parallelism  within  loop  bodies  that  are  too  long  to  pipeline. 

7.  Modulo  Variable  Expansion 

If  traditional  register  assignment  were  performed  before  code  scheduling,  then  the  reuse  of  registers  for 
different  variables  would  significantly  reduce  the  potential  parallelism  in  the  code.  This  is  because  the 
objective  of  register  assignment  is  to  use  as  few  registers  as  possible.  A  register  is  recycled  in  the  shortest 
amount  of  time,  thus  creating  many  more  data  dependences  that  need  to  be  observed.  Cooperation  is 
therefore  required  between  code  scheduling  and  register  assignment  in  a  superscalar  compiler.  Proposed 
strategies  include  combining  register  assignment  with  scheduling  [15],  and  postponing  register 
assignment  until  after  scheduling  [18].  The  latter  approach  simplifies  the  compiler  design  by  separating 
scheduling  and  register  assignment  into  two  different  phases.  The  drawback,  however,  is  that  there  may 
not  be  enough  registers  and  code  needs  to  be  inserted  to  spill  values  to  memory. 

There  is  one  form  of  register  reuse  that  can  greatly  inhibit  parallelization,  and  that  is  the  use  of  the  same 
register  for  the  same  variable  in  different  iterations  of  a  loop.  To  illustrate  this  point,  let  us  use  the  same 
example; 

FOR  i:«  0  TO  n  DO 
A[i]  Aii)+1.0; 

For  the  sake  of  simplicity,  here  we  assume  that  a  floating-point  addition  takes  only  two  clocks.  7116 
object  code  for  one  iteration,  complete  with  register  assignment,  is  as  follows. 
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#  R1  preloaded  with  address  of  A 

#  rR7  preloaded  with  1.0 

LD  FRl,  (Rl) 

FADD  FR1,FR7 
nop 

ST  FRl, (Rl) 

ADD  R1,R1,4 

The  register  assignment  prevents  this  vectorizable  loop  from  executing  in  parallel.  The  register  FKl 
cannot  be  loaded  with  the  next  input  until  after  its  last  use  in  the  previous  iteration.  Similarly,  the  register 
Rl  cannot  be  incremented  until  the  last  store  operation  is  performed.  Anti-dependences  force  the  write 
operations  to  follow  all  the  read  operations  of  the  old  values;  consequently,  the  computation  must  execute 
serially. 

Modulo  variable  expansion  is  a  register  assignment  technique  that  eliminates  these  anti-dependences. 
The  following  is  the  result  of  applying  the  combination  of  software  pipelining  and  modulo  variable 
expansion  to  the  example  above. 

*  Rl  preloaded  with  address  of  A 

#  FR7  preloaded  with  1.0 

LD  FRl, (Rl) 

FADD  FR1,FR7  ADD  R2,R1,4 

1:  LD  FR2, (R2) 

ST  FRl,  (Rl)  FADD  FR2,FR7  ADD  R1,R2,4 

LD  FRl,  (Rl) 

ST  rR2, (R2)  FADD  FR1,FR7  ADD  R2,R1,4  BLoop  1 

nop 

ST  FRl, (Rl) 

To  eliminate  the  anti-dependence  constraint,  the  second  iteration  uses  a  different  set  of  registers,  K2 
and  FR2,  and  can  thus  overlap  with  the  fusL  The  third  iteration,  on  the  other  hand,  can  reuse  the  set  in 
the  first  iteration.  In  fact,  every  other  iteration  can  use  the  same  set  of  registers,  making  the  code  identical 
every  two  consecutive  iterations.  The  length  of  the  steady  state  is  just  twice  the  initiation  interval  and  the 
loop  body  is  therefore  still  very  compact 

We  call  this  optimization  of  assigning  several  registers  to  a  loop  variable  modulo  variable  expansion. 
In  vectorizing  compilers,  scalar  variables  are  expanded  into  arrays  so  that  each  iteration  refers  to  a 
different  array  element,  making  the  loop  vectorizable.  Modulo  variable  expansion  takes  advantage  of  the 
flexibility  of  superscalar  machines,  and  reduces  the  number  of  registers  allocated  to  a  variable  by  reusing 
the  same  location  in  non-overlapping  iterations. 

A  tradeoff  can  be  made  between  the  degree  of  loop  uiuolling  and  the  number  of  registers  used.  For  the 
Warp  machine  which  contains  a  relatively  large  number  of  registers,  minimi ^.ing  the  degree  of  unrolling  is 
a  better  choice  [19J.  Eisenbeis  et  al.,  on  the  other  band,  minimizes  register  usage  because  their  target 
machine,  Cray-2,  has  only  eight  vector  registers  [10). 
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8.  Performance  of  Superscalar  Machines 

Having  functional  units  that  can  be  explicitly  controlled  by  software,  a  superscalar  processor  is  more 
versatile  than  a  vector  machine.  The  parallelism  on  a  vector  machine  is  restricted  to  the  set  of  vector 
instructions,  and,  if  chaining  is  supported,  parallelism  between  vector  instructions  that  use  different 
functional  units.  Using  software  pipelining  to  schedule  a  superscalar  with  similar  functional  units,  a 
simple  loop  that  corresponds  to  a  vector  instruction,  such  as  the  pairwise  additions  of  two  vectors,  can 
execute  at  the  same  throughput  rate  as  a  vector  instruction.  In  addition,  a  superscalar  can  And  parallelism 
in  complex  loops.  Loops  do  not  need  to  be  decomposed  into  simple  vector  instructions  which  require 
partial  expressions  be  buffered  in  vector  registers.  More  importantly,  a  superscalar  can  And  parallelism  in 
loops  with  recurrences  and  conditional  statements. 

The  ability  of  a  superscalar  machine  to  execute  custom  generated  parallel  code  eliminates  the  need  for 
buffering  vectors  of  partial  results.  For  example,  a  vectorizing  compiler  must  decompose  the  loop  in 
Figure  8-1  (a)  into  two,  each  corresponding  to  a  vector-add  instruction  (Figure  8-1  (b)).  The  partial  sums 
must  be  buffered  in  a  vector  register.  On  a  superscalar  machine,  the  partial  results  ran  be  operated  on  as 
soon  as  they  are  generated,  as  illustrated  in  Figure  8-1  (c).  This  reduces  the  number  o(  registers  needed 
and  possibly  memory  accesses. 

(a)  FOR  i  0  TO  n  DO  BEGIN 

c[il  :■  atil+t>[il+c[i]  ; 

END; 

(b)  FOR  i  :«■  0  TO  n  DO  BEGIN 

t[il  a[i]+b[i]; 

END; 

FOR  i  0  TO  n  DO  BEGIN 
c [i]  t (il+c [i] ; 

END; 

(C)  FOR  i  0  TO  n  DO  BEGIN 

t  a [i]+b[i] ; 
c [i]  ;»  t+cii] ; 

END; 

Figure  8-1:  Reduced  register  requirement  in  superscalar  machines 
(a)  source  program,  (b)  vector  code,  (c)  scalar  code. 

A  recurrence  does  not  necessarily  mean  serial  execution  for  superscalar  machines.  As  long  as  there  are 
other  operations  that  can  execute  in  parallel  with  the  recurrence  computation,  a  high  computation  rate  can 
still  be  obtained  using  software  pipelining.  The  degree  of  parallelism  in  a  vectorized  loop  is  of  the  order 
of  the  number  of  iterations  in  the  loop.  A  recurrence,  however,  limits  the  degree  of  parallelism  by  the 
ratio  of  independent  operations  to  the  length  of  the  cyclic  dependence.  This  limited  form  of  parallelism 
can  be  exploited  in  superscalar  processes  because  of  their  unique  zero  synchronization  oveitead.  The 
compiler  strategy  for  superscalar  machines  is  different  from  that  for  vector  machines.  A  vectoriziiig 
compiler  tries  to  decompose  a  loop  into  smaller  loops,  separating  recurrences  from  vectorlzable  loops.  A 
superscalar  compiler,  on  the  other  hand,  tries  to  jam  independent  loops  together.  The  vectorlzable  loop 
may  be  executed  ou  the  idle  functional  units  while  the  program  computes  a  recurrence! 

In  addition  to  recurrences,  hierarchical  reduction  allows  us  to  And  parallelism  even  in  loops  with 
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conditional  statements.  Hierarchical  reduction  also  reduces  the  penalty  typically  associated  with  short 
vectors.  In  a  superscalar  machine,  the  scalar  operations  can  be  overlapped  with  the  prolog  and  epilog  of  a 
software  pipelined  loop.  This  easy  integration  of  scalar  and  vector  operations  makes  the  performance  of 
the  system  less  sensitive  to  the  size  of  the  data.  Moreover,  software  pipelining  can  be  applied  even  to 
outer  loops,  making  the  advantages  of  software  pipelining  applicable  even  for  programs  containing  short 
innermost  loops. 

The  instruction  scheduling  and  register  assignment  techniques  have  been  implemented  in  the  compilers 
for  the  Warp  and  iWarp  machines,  and  have  been  extensively  evaluated  [18].  The  Warp  processor  has  a 
peak  computation  rate  of  10  MFLOPS,  an  impressive  performance  for  a  machine  built  in  1986.  This  peak 
computation  rate  is  achieved  by  a  high  degree  of  parallelism  and  specialization.  In  a  single  instruction,  a 
Warp  processor  can  perform  one  7-stagsd  floating-point  addition,  one  7-staged  floating-point 
multiplication,  one  memory  read,  one  memory  write,  t«'0  integer  operations,  and  one  branch  operation. 

We  have  analyzed  the  performance  of  a  set  of  seventy-two  programs  and  the  Livermore  kernels.  The 
performance  of  most  of  the  programs  fall  between  the  1  to  4  MFLOPS  range,  with  a  2.8  MFLOPS 
average.  This  utilization  of  resources  is  higher  than  that  typically  observed  in  supercomputers. 
Performance  analysis  of  the  software  pipeliner  shows  that  the  scheduler  is  successful  in  exploiting 
parallelism  once  the  parallelism  is  detected.  About  three-quarters  of  over  one  hundred  loops  pipelined  are 
provably  optimal.  When  compared  with  code  generated  by  a  compiler  that  fmds  parallelism  only  within  a 
basic  block,  most  of  the  loops  achieve  a  speed  up  of  between  two  and  six. 

9.  Conclusions 

This  paper  presents  an  overview  of  compiler  optimizations  that  exploit  parallelism  in  a  superscalar 
machine.  High-level  loop  transformations  improve  data  locality  and  place  parallelism  in  the  innermost 
loops,  in  preparation  for  instruction  level  optimizations.  Software  pipelining  is  the  basic  technique  that 
finds  parallelism  across  iterations  in  inner  loops.  Hierarchical  reduction  helps  deliver  a  high  level  of 
perfonnance  for  a  broader  range  of  applications,  for  example,  by  permitting  software  pipelining  to  be 
used  even  for  loops  with  conditional  statements.  And  lastly,  modulo  variable  expansion  eliminates 
dependence  constraints  due  to  reuse  of  registers  between  iterations. 

The  superscalar  architecture  is  a  promising  alternative  to  vector  machines.  We  now  have  compiler 
techniques  that  can  generate  highly  efficient  parallel  code  directly  from  user  programs.  Given  the  same 
hardware  functional  units,  a  superscalar  machine  delivers  the  same  performance  of  a  vector  machine  if  the 
program  is  vectorizable.  And  the  superscalar  machine  is  decidably  superior  to  vector  machines  when  the 
computation  contains  recurrences  and  conditional  statements.  A  superscalar  does  not  exhibit  a  dichotomy 
in  performance  depending  on  whether  the  code  is  vectorizable  or  not. 

G>mpiler  optimizations  require  programs  to  be  analyzable  statically.  A  superscalar  architecture  has  an 
organization  that  is  more  easily  enhanced  to  handle  programs  that  are  not  amenable  to  static  analysis,  By 
a  judicious  use  of  hardware  to  provide  dynamic  information  to  cooperating  software,  processors  that 
deliver  a  consistent  high-performance  through  instruction  level  parallelism  are  possible. 
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